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Abstract 

The  objective  of  this  research  was  to  demonstrate  that  classical  transport  equation  (RANS) 
models  can  be  applied  at  any  mesh  resolution.  In  particular,  we  show  that  transport  equation 
models  like  the  classical  k/e  model  also  make  excellent  subgrid  scale  models  for  Large  Eddy 
Simulation  (LES).  However,  this  research  is  not  concerned  with  the  development  of  a  particular 
RANS/LES  model  but  a  general  approach  to  turbulence  modeling  for  any  mesh  resolution.  To 
confirm  the  generality  of  the  approach,  a  Reynolds  stress  transport  (RST)  equation  model  is  also 
shown  to  work  well  as  an  automatically  adaptive  LES  subgrid  scale  model. 

Unlike  other  hybrid  modeling  approaches  that  can  address  a  range  of  mesh  scales,  the 
demonstrated  approach  is  self-adaptive.  It  will  always  calculate  using  first  principals  as  much  of 
the  turbulence  as  the  mesh  allows,  and  will  model  the  rest.  Unlike  other  hybrid  models,  this 
approach  is  not  a  blending  of  two  models,  nor  docs  it  require  additional  user  specified  constants. 
The  character  of  the  model  is  self-adjusting  and  is  not  a  function  of  some  external  input  such  as 
the  geometry.  The  approach  is  easy  to  implement  using  existing  CFD  codes. 

Turbulence  modeling  is  the  bottleneck  in  current  Computational  Fluid  Dynamics  (CFD) 
predictions  of  engineering  flows.  The  proposed  modeling  approach  is  fundamentally  different 
from  prior  LES  models  and  current  hybrid  models  in  that  it  achieves  a  completely  natural 
evolution  from  RANS  to  LES  to  DNS,  using  largely  existing  modeling  technology. 


1 


1.  Introduction 


Turbulence  models  are  frequent fy  classified  by  the  ratio  of  how  much  turbulent  energy  is 
represented  by  the  model  compared  to  how  much  turbulent  energy  is  computed  via  first 
principals.  RANS  (Reynolds  averaged  Navier-Stokes)  models  represent  the  most  turbulent 
energy  in  the  model.  LES  (large  eddy  simulation)  computes  considerably  more  of  the  turbulent 
energy  via  first  principals  and  DNS  (direct  numerical  simulation)  computes  all  the  turbulent 
energy  correctly  and  models  none.  If  a  more  detailed  terminology  is  desired,  in  between  RANS 
and  LES  lies  URANS  (unsteady  RANS)  and  VLES  (very  large  eddy  simulation).  The  range  of 
these  turbulence  models  is  shown  in  Figure  1  in  relation  to  a  3D  turbulent  energy  spectrum.  Each 
model,  tries  to  represent  the  energy  in  the  spectrum  to  the  right  of  the  model's  name. 


Figure  L  Illustration  of  turbulence  modeling  (RANS- DNS)  on  a  3D  energy  spectra. 


Classic  LES  sub  grid-scale  models  (like  Smagoriski  or  its  dynamic  model  varients)  are  algebraic 
in  nature  and  assume  that  equilibrium  is  present  between  the  unresolved  and  resolved  scales. 
Unfortunately,  this  equilibrium  is  actually  present  only  on  average.  The  motivation  for  using  a 
transport  equation  model  is  both  theoretical  and  practical  Theoretically,  a  transport  equation 
model  can  capture  the  inherently  non-equilibrium  subgrid-scale  turbulence  physics  more 
accurately.  Practically,  a  transport  equation  based  model  can  easily  transition  to  a  RANS  or 
URANS  model  at  coarse  mesh  resolutions.  In  low  Mach  number  flows,  the  solution  for  the 
pressure  dominates  the  calculation  lime,  and  the  solution  of  additional  transport  equations  are  not 
expected  to  incur  a  large  performance  penalty,  In  high  Mach  number  flows,  the  turbulence 
equations  can  be  advanced  less  frequently  and  with  a  larger  timestep,  and  the  conclusion  remains 
the  same. 

Classical  algebraic  LES  models  require  the  mesh  spacing  to  lie  in  the  self-similar  inertial  range  of 
the  energy  spectrum.  In  practical  applied  simulations  is  not  always  possible  to  ensure  that  the 
mesh  spacing  lies  in  some  inertial  or  self-similar  regime.  Near  walls,  there  is  no  inertial  range.  In 
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complex  flows  predicting  the  inertial  range  a  priori  is  practically  impossible.  A  model  that  can 
perform  at  any  mesh  resolution  will  not  be  restricted  by  this  current  LES  limitation  on  the  mesh 
size.  In  addition,  in  many  situations  it  is  wasteful  to  perform  LES  in  regions  of  the  computation 
(flat  plate  boundary  layers  and  simple  mixing  layers)  where  RANS  models  arc  known  to  perform 
quite  well.  LES  becomes  more  useful  if  the  model  can  function  in  the  RANS  limit  for  those 
regions  of  the  flow  that  are  not  too  complex.  Classic  LES  models  cannot  function  in  the  RANS 
(coarse  mesh)  or  DNS  (very  fine  mesh)  limits  because  the  mesh  size  is  no  longer  a  useful 
indicator  of  the  turbulent  length  scales  (as  it  is  in  the  inertial  range). 

There  are  a  number  of  existing  hybrid  turbulence  models  that  are  designed  to  be  able  to  address 
these  issues  and  perform  over  a  broad  range  of  the  mesh  spectrum  (i.e.  URANS  down  through 
LES).  For  example,  the  very  popular  DBS  (detached  eddy  simulation)  model19  behaves  like  a 
RANS  or  URANS  model  near  walls,  but  away  from  walls  the  lengthscale  is  changed  to  the  mesh 
size  and  the  model  has  an  LES  character.  Other  hybrid  models  do  not  change  their  character 
based  on  location  relative  to  a  wall,  but  on  whether  the  mesh  is  much  smaller  than  the  energy 
containing  turbulence  scales  (leading  to  LES)  or  not  (leading  to  RANS  or  URANS).  The  earliest 
implementation  of  such  an  approach  Speziale20  used  classic  LES  (Smagorinsky)  and  RANS 
( k/e)  models  to  solve  for  both  an  LES  and  a  RANS  eddy  viscosity  and  then  blended  these  two 
viscosities  together  based  on  a  function  of  the  mesh  size.  Girimaji10  has  developed  a  hybrid 
model  (PANS)  that  can  change  its  character  based  on  input  from  the  user  (the  user  sets  the 
desired  ratio  of  modeled  turbulent  kinetic  energy).  The  results  presented  in  the  following  report 
can  not  be  obtained  with  any  of  these  previous  approaches. 

In  this  research,  a  self-adapting  turbulence  modeling  approach  was  developed  that  works  for  any 
mesh  resolution  and  over  the  entire  energy  spectrum.  It  can  therefore  do,  RANS,  URANS,  VLES, 
LF-S  and  even  DNS.  More  importantly,  the  character  of  the  model  is  not  set  by  the  user  (like 
PANS)  or  geometric  location  (like  DES),  but  instead  will  adapt  to  whatever  level  the  mesh  can 
support.  The  proposed  approach  therefore  models  only  as  much  kinetic  energy  is  necessary  (for  a 
given  mesh)  and  resolves  as  much  of  the  energy  using  first  principals  as  possible.  It  is  not 
technically  correct  to  consider  the  proposed  approach  to  be  a  hybrid  model  in  the  classic  sense 
(though  it  has  many  similarities  to  those  models)  because  it  does  not  blend  an  LES  and  a  RANS 
model  together.  The  proposed  approach  is  a  ‘universal’  modeling  approach,  It  is  a  single  model 
that  works  naturally  at  any  mesh  resolution. 


2.  A  Universal  Model 

The  classic  mathematical  theory  behind  RANS  and  LES  makes  these  two  modeling  approaches 
look  fundamentally  different.  RANS  is  based  on  ensemble  averages  and  LES  on  filtering.  At  first 
glance,  (he  possibility  of  a  single  model  that  does  both  (without  some  sort  of  switch  or  blending 
function)  seems  remote.  However,  a  closer  examination  by  Germano*  revealed  some  very 
important  insights.  Most  importantly,  the  exact  but  unclosed  governing  equations  for  RANS  and 
LES  (and  URANS,  VLES  and  DNS)  are  all  mathematically  identical.  While  the  RANS 
equations  can  be  derived  from  the  assumption  of  ensemble  averaging  and  the  LES  equations 
from  filtering  operations,  these  assumptions  are  overly  restrictive  and  neither  system  must  he 
derived  with  those  assumptions.  The  only  required  assumption  is  that  the  velocity  field  can  be 
split  into  two  parts  and  that  this  splitting  operation  commutes  with  differentiation,  With  this 
assumption  the  equations  for  turbulence  evolution  are  (from  Appendix  A), 
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turbulent  stress  tensor.  The  exact  (but  unclosed)  evolution  equation  for  this  stress  tensor  is, 
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where  the  bracket  operation  is  given  by  <a,,bJ  >xajb  —atb  and  the  turbulent  transport  is 
defined  by  T.jk  =  u,ujui  -utRlk  -  «  /?.,  ~utRfl  -ulu/ui .  Note  that  the  stress  tensor  can  also  be 
defined  using  the  bracket  notation,  Rv  =<  ul,ul  >  .  The  turbulent  transport  and  bracketed  terms 

require  a  model  if  the  system  is  to  be  solved.  In  RANS  the  overbar  might  denote  an  ensemble 
average.  In  LES  the  overbar  might  be  an  explicit  filtering  operation.  However,  it  can  also  be  an 
implicit  operation,  because  in  practice  when  these  equations  are  modeled  and  then  solved  on  a 
computer,  the  overbar  operation  is  never  actually  performed.  In  this  case,  it  is  assumed  that  an 
overbar  represents  whatever  the  calculation  computes.  It  is  not  possible  to  prove  that  an  implicit 
filter  commutes  with  differentiation  but  it  is  a  fairly  reasonable  assumption  to  make  at  least  to 
first  order. 

To  accurately  model  some  of  the  terms  in  (l)  and  (2),  a  third  'scale'  equation  is  often  postulated 
that  captures  the  turbulent  energetic  length  or  timescale.  The  epsilon  equation13  is  perhaps  the 
most  commonly  used  scale  equation,  but  omega  (inverse  timescale)23  and  lengthscale 
equations15,18  also  are  possible  and  can  be  advantageous.  In  theory,  these  scale  equations  can  be 
derived  from  first  principals  but  the  number  of  modeling  assumptions,  in  practice,  makes  them 
largely  empirical. 

Starting  from  these  exact  equations  numerous  modeling  approaches  are  possible.  In  order  of 
increasing  simplicity  a  brief  description  is  given  for  some  of  the  more  common  approaches. 
Reynolds  stress  transport  (RST)  models  use  a  modeled  form  of  the  tensor  equation  (2).  The  more 
popular  two-equation  RANS  models  (such  as  k/e  or  klto)  are  a  simplification  of  equation  (2) 
from  a  tensor  equation  to  a  scalar  equation  (by  taking  its  trace).  The  primary  unknown  must  then 
be  reconstructed  from  this  scalar  kinetic  energy,  k  ,  using  a  hypothesized  algebraic  relation  such 
as  the  eddy  viscosity  hypothesis.  One-equation  models,  such  as  the  Spalart-Allmaras19  model 
used  as  the  basis  for  DES30  solve  a  single  transport  equation  directly  for  the  eddy  viscosity  and 
use  an  algebraic  expression  for  the  lengthscale  (such  as  the  distance  to  the  wall).  Classic  LES 
models  (such  as  Smagorinsky  and  its  variants)  are  the  simplest  models  of  all,  they  solve  no 
transport  equations  for  the  turbulence  variables  but  algebraically  obtain  the  necessary  length  and 
timescales  from  the  mesh  size  and  the  resolved  velocity  gradients. 
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Traditionally,  LES  models  have  used  the  very  simplest  modeling  approach  because  of  the  issue  of 
cost.  Early  attempts  at  using  more  complex  transport  equations  were  deemed  not  worth  the  extra 
effort.  Deardorff7  used  a  RST  model  and  Schumann17  used  a  kls  model.  Models  and  computing 
power  have  changed  considerably  in  the  30  years  since  those  first  simulations.  Some  more  recent 
LES  models  now  cany  a  single  transport  equation.  This  is  certainly  the  case  in  DES,  and  a  kinetic 
energy  transport  equation  was  used  by  Ghosal  e,t.  aL9.  As  the  mathematical  analysis  of  Gcrmano* 
makes  clear,  there  is  no  fundamental  reason  why  these  more  complex  modeling  approaches  (used 
currently  only  by  RANS  models)  cannot  also  be  applied  to  LES.  The  apparent  natural  evolution 
of  turbulence  models  to  include  more  physics  and  therefore  more  complexity,  suggests  that  two- 
equation  and  RST  transport  models  for  LES  are,  in  fact,  the  next  logical  step.  The  ‘universal’  k/e 
model  and  its  results  are  presented  first  (section  3),  and  the  RST  model  and  its  performance  are 
present  in  section  7, 


3,  A  Two-Equation  LES  Model 

The  classical  kf£  equation  system  is  used  in  this  first  section  in  order  to  reach  the  largest 
audience  possible.  There  are  very  good  reasons  to  prefer  other  two-equation  model  systems. 
However,  since  all  two-equation  transport  models  are  closely  related,  the  proposed  modeling 
ideas  can  be  easily  generalized  to  these  other  transport  equation  models.  It  should  be  very  easy  to 
implement  the  model  presented  in  this  section  into  existing  commercial  and  industrial  CFD 
codes. 

The  unclosed  equations  (1 )  and  (2)  will  be  modeled  using  the  following  transport  equations. 


d«  d  d(p  +  -/c)  d  d  3 

(3) 

CS>=/  ir  Vs*  ]+«/>-« 

dl  dxt  dx,  ci,  dxt 

(4) 

de  d  ,  .  d  t/v  +  vT.  df .  £„  _  _  . 
a/+,  =  5  K  ]+,EC'^  C'i£} 

dl  dxt  dxt  af  dx/  k 

(5) 

where  the  overbar  on  the  velocity  and  pressure  are  now  dropped  for  convenience.  The 

k7  k 

production  is  given  by  P  =  vT{ut  +  u  )*/,  and  eddy  viscosity  is  given  by  vT  =  C  — ( - — ) . 

£  k  +  kr 

II  is  now  necessary  to  distinguish  between  the  modeled  (or  unresolved)  turbulent  kinetic  energy, 
k,  and  the  resolved  kinetic  energy,  kr  ~ +  WjW2  +  u}u}) ,  which  is  calculated  from  the 
resolved  velocity. 

The  constants  are  fairly  standard  k/e  constants,  Cfi~  1.55,  Ot  -  1.2 ,  <7 t  =1.0,  Ctt  =  0, 1 8 .  The 
11  25 

parameter  Ct2  =  —  /  +  —  f2  is  sensitive  to  the  local  unresolved  turbulent  Reynolds  number 
6  Rer 
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ReT  =  —  of  the  modeled  turbulence  via  the  function  f  =  jl  +-^ — 1)  as  per  the  analysis 

V£  30  ]j  ReT 

of  Perot  and  de  Bruyn  K.opsM.  This  varies  Ce2  from  its  theoretical  limits  of  11/6  at  high 

Reynolds  numbers  to  3/2  at  low  Reynolds  numbers.  Any  Reynolds  number  dependent  Ct2 

would  probably  be  sufficient.  Reynolds  number  dependence  is  important  in  LES  because  the 

effective  turbulent  Reynolds  number  ReT  =  kllv£  becomes  small  as  the  mesh  is  refined  (and 

goes  towards  zero  for  DNS).  For  incompressible  flow,  the  pressure  in  equation  (3)  is  determined 

from  the  incompressibility  constraint,  u  t  =  0 . 

The  two  major  differences  in  this  equation  system  from  the  classical  k/c  model  is  the  definition  of 
the  eddy  viscosity  which  now  contains  the  kinetic  energy  ration,  and  the  presence  of  the 
additional  parameter  a  which  is  discussed  in  the  next  section. 

3.1  Backscatter  of  Energy 

One  key  component  of  a  self-adaptive  turbulence  model  is  that  it  must  be  able  to  backscatter 
energy  from  the  unresolved  (modeled)  turbulence  to  the  calculated  (resolved)  velocity  field.  For 
example,  a  fine  grid  (128J )  simulation  of  isotropic  turbulence  would  resolve  most  of  the  energy 
and  very  little  would  need  to  be  modeled.  If  this  simulation  were  set  up  incorrectly,  and  most  of 
the  energy  were  defined  to  be  modeled  and  very  little  resolved  (i.e.  initialized  from  a  RANS 
simulation),  the  adaptive  turbulence  model  ought  to  remove  energy  from  the  modeled  kinetic 
energy  and  energize  the  resolved  velocity  field  to  correct  this  error.  Figure  2  illustrates  the 
behavior  of  backscatter  (and  the  physically  more  dominant  -  forward  scatter)  for  a  id  energy 
spectra,  i  lere  the  resolved  turbulence  is  on  the  left  and  the  modeled  turbulence  is  to  the  right 
(shaded),  the  arrows  indicate  the  direction  of  energy  flow. 


Figure  2,  Illustration  of  backscatter  and  forward  scatter  on  a  ID  energy  spectra. 
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With  the  ability  to  backscatter,  a  turbulence  model  can  automatically  correct  errors  in  the  initial 
conditions,  ami  most  importantly,  perform  model  adaptation  without  user  intervention*  The  idea 
of  allowing  backscatter  in  a  turbulence  model  is  not  a  new  one.  It  has  been  shown  by  Chasnov3 
and  Carati1  ct  a!.,  that  the  -5/3  power  taw  in  isotropic  decay  is  better  predicted  by  LBS 
(dynamic)  models  that  account  for  backscatter*  Along  similar  lines,  classical  DES  can  not 
backscatter  energy  but  it  has  been  recently  shown  by  Piomelli  el  al,*6  that  adding  noise,  a  crude 
form  of  energy  backscatter,  to  the  DES  model  improved  results  for  channel  flow. 


If  a  -  t ,  the  propsed  system  (Equations  (3),  (4),  (5))  is  very  close  to  a  classic  UIe  modet.  The 
proposed  formulation  assumes  that  the  turbulent  stress  tensor  is  reconstructed  using  the  eddy 


viscosity  hypothesis,  Rtf  -  -  kdtj  -vTa(uf  f  +ttfi 


).  This  of  course  is  the  simplest  reconstruction 


hypothesis  to  use.  Nonlinear  eddy  viscosity  and  algebraic  Reynolds  stress  models  use  more 
complex  (but  still  algebraic)  reconstructions.  Again,  the  proposed  ideas  can  easily  be  generalized 
to  that  context  as  well.  The  simplest  model  possible  is  used  in  this  work  in  order  to  focus  as 
directly  as  possible  on  the  key  idea  -  that  it  is  possible  to  develop  turbulence  models  that 
automatically  work  at  any  mesh  resolution* 


RST  models  based  on  equation  (2)  can,  and  do,  backscatter  energy.  We  will  show  this  effect  later. 
But  RST  models  are  more  complex  and  less  familiar,  to  we  wish  to  enable  the  two-equation 
models  to  perform  as  universal  models,  and  this  will  require  adding  backscatter.  The  classical 
two-equation  models  are  based  on  a  positive  definite  eddy  viscosity  which  is  too  simplified  an 
assumption  (eddy  viscosity  should  actually  only  be  positive  on  average)  and  therefore  classical 
two-equation  models  can  not  backscatter  energy.  The  eddy  viscosity  is  always  positive  and 
therefore  the  model  always  removes  energy  from  the  mean  flow  and  puts  it  into  the  modeled 
kinetic  energy,  k  ,  where  it  is  eventually  dissipated  by  e  to  heat  It  can  never  move  energy  from 
the  model  to  the  resolved  scales.  This  is  the  correct  average  behavior  for  a  turbulence  model,  but 
not  necessarily  locally  correct  (in  space  or  time). 


The  additional  parameter  a  has  been  added  to  the  classic  kfe  model  to  correct  this  important 
flaw  and  control  the  energy  flow.  Usually  a  is  positive  (and  order  1),  but  it  can  become  small  or 
even  negative.  When  a<  0  the  model  is  backscattering  energy.  This  parameter  is  not  a  model 
constant,  instead  it  is  a  field  that  varies  in  space  and  time,  so  that  backscatter  can  happen  in 
different  regions  at  different  times*  The  transfer  variable,  a ,  also  appears  in  Equation  (4),  the  k- 
equation,  so  that  the  total  kinetic  energy  (modeled  k  plus  resolved  kr )  is  a  conserved  quantity 
and  can  only  disappear  via  dissipation  to  heat.  Its  presence  is  not  necessary  in  the  scale  equation. 
Equation  (4),  so  wc  do  not  include  it. 

When  a  <  0 ,  the  eddy  viscosity  in  (4)  is  essentially  negative.  Negative  viscosity  is  anti-diffusive, 
it  amplifies  (rather  than  damps)  existing  resolved  velocity  fluctuations.  It  amplifies  small 
wavelength  modes  (those  closest  to  the  mesh  resolution)  the  most  rapidly.  This  is  a  very 
reasonable  model  for  backscatter  It  is  not  injecting  energy  via  some  random  forcing  of  the 
resolved  flow,  rather  it  works  to  enhance  the  existing  instabilities  and  modes.  Moreover  the 
energy  transfer  is  local  in  spectral  space.  It  tends  to  take  energy  from  the  model  (which  has  most 
of  its  energy  at  scales  just  below  the  mesh  resolution)  and  preferentially  delivers  it  to  the  resolved 
flow  at  almost  the  same  length  scale  (but  just  above  the  mesh  resolution). 
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The  model  for  a  the  energy  flow  parameter,  uses  ideas  from  enor  estimation  for  mesh 
adaptation.  If  the  error  in  the  computed  solution  is  small,  then  the  mesh  is  deemed  to  be  sufficient 
for  first  principals  simulation  and  the  model  should  go  away  (the  modeled  kinetic  energy  should 
become  very  small).  This  is  akin  to  the  situation  described  earlier,  where  a  128J  grid  is  initialized 
with  a  RANS  solution  and  some  small  resolved  flow  fluctuations.  The  model  will  detect  this 
situation  as  very  highly  resolved  (lots  of  mesh  resolution).  Due  to  energy  conservation  inherent  in 
Equations  (3)  and  (4),  the  modeled  kinetic  energy  can  not  just  disappear.  In  order  to  become 
small  the  modeled  kinetic  energy  must  be  transferred  somewhere  -  the  only  possibility  is  to  the 
resolved  flow.  Small  estimated  errors  (in  the  resolved  quantities)  implies  there  should  be  energy 
backscatter  (the  mesh  can  handle  more  fluctuations  via  direct  simulation).  So  small  errors  result 
in  a <0. 

At  some  point  later  in  time,  the  resolved  velocity  on  the  I283  mesh  wilt  be  fully  energized,  If  the 
Reynolds  number  is  high,  then  a  1283  mesh  is  still  not  sufficient  to  perform  DNS  (a  model  is  still 
necessary).  In  this  case  the  error  estimate  will  become  larger  and  larger  until  a  becomes  positive 
and  a  normal  forward  energy  scatter  down  the  energy  cascade  will  be  imposed.  The  larger  the 
error,  the  more  energy  is  fed  to  the  modeled  kinetic  energy  and  the  more  the  model  influences  the 
resolved  flow  evolution.  If  the  Reynolds  number  is  low  enough,  then  a  1 28J  mesh  may  actually 
be  sufficient  resolution  for  DNS  and  the  error  estimate  stays  smalt.  In  this  case  the  model 
continues  to  backscatter  until  the  model  has  almost  no  energy.  At  this  point  the  model  has  no 
affect  on  the  resolved  modes  and  DNS  is  achieved. 


3.2  Energy  Transfer  Variable 

The  proposed  equation  for  controlling  the  energy  transfer  is, 

a  - 1  .S(1.0-C'(-T— )![  +0.1  IF') 

k+kr  Jkr  dx, 


(6) 


where  kr  is  the  resolved  kinetic  energy  (at  a  certain  location  and  time),  k  is  the  modeled  kinetic 
energy,  and  C*=0.28.  The  quantity 

(Ax,  ~^~)2/k,  =  ((Ax d^*L);  +  (Ay  )-  +  is  a  dimensionless  measure  of 

axt  ax  ay  az 

the  error  (similar  to  what  is  sometimes  used  in  mesh  adaptation).  In  this  formulation  the  resolved 


kinetic  energy  kr  =  -^(i/13  +u:  +H.i)  *s  l^e  indicator  function  that  is  being  used  to  estimate  the 


mesh  resolution.  If  the  flow  is  DNS  or  over-resolved  (like  a  RANS  initial  condition  on  an  LES 
mesh)  then  this  quantity  is  small,  its  inverse  is  large  (but  limited  away  from  infinity  by  the  fairly 
arbitrary  0.11  term),  and  the  model  tends  to  backscatter  energy.  In  contrast,  normal  energy 
transfer  (from  resolved  scales  to  the  modeled  scales)  occurs  in  the  regions  of  the  flow  where  the 
gradient  length  scales  are  comparable  to  the  mesh  size.  On  very  coarse  meshes,  RANS  like 
behavior  should  be  recovered.  In  this  limit,  kr  is  expected  to  be  very  smalt,  but  note  that 
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(  Axt  ■  r  )2/kr  will  remain  finite  and  independent  of  the  magnitude  of  the  resolved  fluctuations, 

3*, 

For  the  simulations  of  isotropic  turbulence  performed,  this  tenn  obtains  an  average  value  of  0.8  in 

the  RANS  limit.  This  means  that  a— >1.5- (142/(0.8  +  0. 1 1 )  =  1.04  and  the  standard  RANK 
model  is  very  closely  recovered  in  the  RANS  limit. 

The  particular  form  of  the  energy  transfer  function  was  developed  largely  by  intuition  and  tuned 
solely  to  obtain  the  correct  limits.  Many  other  functional  expressions  and/or  indicator  quantities 
are  certainly  possible.  The  goal  of  this  work  is  not  to  advocate  for  this  particular  function  but  to 
demonstrate  that  self-adaptive  turbulence  models  are  possible,  and  this  particular  function  serves 
this  purpose  adequately. 


3-3  Numerical  Method 

Simulations  of  isotropic  turbulence  are  frequently  performed  with  Fourier  spectral  methods  that 
use  discrete  FFTs  to  solve  the  pressure  Poisson  equation.  Spectral  methods  are  not  indicative  of 
what  is  used  in  commercial  codes.  On  the  other  hand,  commercial  codes  tend  to  have  excessive 
numerical  dissipation  (because  high  robustness  is  sought).  This  numerical  dissipation  can  lead  to 
poor  results  -  even  in  the  DNS  cases  that  do  not  involve  the  model.  This  work  compromised 
between  the  two  extremes  by  using  a  2nd  order  Cartesian  staggered  mesh  method1 1  (see  Figure  3) 
with  exact  projection2  for  the  pressure  solution.  This  is  essentially  a  low  order  finite  volume 
method  (like  the  commercial  codes)  that  conserves  mass,  energy,  and  vorticity  properly  (like  the 
spectral  methods).  The  code  is  fully  parallel  (using  MPl)  and  optimized  for  execution  on  PC 
clusters. 


Figure  3.  Fully  staggered  arrangement  of  velocity  (u,  v)  and  pressure  (p).  The 
actual  simulation  is  three  dimensional 


4.  Isotropic  Decaying  Turbulence 

We  wish  to  demonstrate  the  validity  of  the  approach  using  one  of  the  simplest  turbulence  models 
(k/c)  and  one  of  the  simplest  turbulent  flows  possible  (decaying  turbulence).  This  is  intentionally 
done  so  there  is  no  room  for  ambiguity.  Neither  the  complexity  of  the  problem  nor  the  model  can 
obfuscate  the  results. 

It  should  be  noted  that  other  hybrid  methods,  such  as  DES,  do  not  work  well  for  this  very  simple 
problem.  In  the  absence  of  walls  DES  always  does  LES,  and  when  the  mesh  becomes  too  coarse 
it  will  fail  like  any  other  classical  LES  model.  PANS  will  not  adapt  as  the  turbulence  decays,  so 
when  the  turbulence  has  decayed  into  the  DNS  limit,  PANS  will  still  be  applying  an  LES  eddy 
viscosity. 

Isotropic  decaying  turbulence  was  calculated  using  periodic  boundary  conditions  on  a  box  of  size 
of  18JFJtl8/rx36/T,  This  initial  box  size  was  used  in  order  to  compare  our  results  with  an 
independently  performed  DNS  simulation  of  this  flow.  Our  mesh  size  will  always  be  double  in 
the  third  direction  so  that  the  actual  mesh  spacing  is  always  equal  in  all  three  directions. 

Figure  4  shows  a  mid-plane  slice  of  an  initial  condition  256x256x512  isotropic  field  for  reader 
visualization.  It  is  apparent  from  figure  4  that  the  256x256x512  initial  condition  is  turbulent  and 
isotropic  (no  preferential  direction).  While  the  following  results  focus  on  isotropic  turbulence 
because  of  its  simplicity,  it  must  be  stressed  that  this  methodology  can  easily  be  applied  to  any 
flow  field,  and  makes  no  assumptions  about  the  problem  being  solved. 


Figure  4:  Mid -plane  slice  (x-y  plane)  through  initial  256x256x512  isotropic  field,  Re  =  640. 
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4.  1  High  Re  Isotropic  Decay 


in  isotropic  turbulence,  the  turbulent  kinetic  energy  decays  with  time.  This  process  can  be 
simulated  with  a  simple  RANS  model  using  only  one  spatial  grid  cell,  with  LES  using  many 
cells,  or  with  DNS  using  enough  cells  to  resolve  the  smallest  length  scales  of  motion.  Results 
from  such  a  test  are  shown  in  Figure  5.  The  ordinate  shows  total  kinetic  energy  (k  +  kr) 
normalized  by  the  initial  total  kinetic  energy  at  time  t=0.0,  the  abscissa  r ,  is  given  by  simulation 

£ 

time  (seconds)  non-dimensionalized  by  the  inverse  time  scale  (  — )  at  time  t=G.O  (where 

K 

indicates  total  quantities).  Normalization  of  the  abscissa  can  be  interpreted  as  'eddy1  turn  over 
times. 


Tau 


Figure  5.  Total  kinetic  energy  predictions  of  isotropic  decay  at  initial  Re=640, 


The  initial  turbulent  Reynolds  number,  Rer  =  — —  ,  for  this  test  case  is  640,  comparable  to  the 

V£t 

Comte- Bel  lot  and  Corrsin  1971  experiment"1.  The  DNS  result  was  performed  independently  by  dc 
Bruyn  Kops^  on  a  768x768x1536  mesh  using  a  Fourier  spectral  method  and  is  given  by  the  large 
circles.  Many  aspects  of  the  DNS  (including  spectra)  have  been  closely  compared  with  Comte- 
Bellot  and  Corrsin  and  shown*"  to  agree  well.  The  mesh  resolution  is  identical  in  each  direction, 
with  the  box  size  in  the  z-dircclion  twice  as  large. 


A  number  of  different  simulations  were  performed  using  mesh  resolutions  from  1x1x2  to 
256x256x512.  In  each  case  the  model  is  identical  and  only  the  mesh  and  initial  conditions  are 
changed.  The  initial  conditions  are  obtained  from  the  full  DNS.  The  DNS  initial  condition  is  not 
random  Fourier  modes  -  but  full  Navier-Stokcs  turbulence  that  was  obtained  by  running  for  a 
long  time  with  as  little  forcing  of  the  large  scales  as  possible,  to  maintain  the  kinetic  energy. 

Each  model  initial  condition  was  formed  by  averaging  the  same  initial  768x768x1536  velocity 
field  to  the  appropriate  mesh  size  using  a  simple  box  average.  The  initial  modeled  kinetic  energy 
at  each  mesh  location  was  then  determined  by  comparing  the  exact  768x768x1536  velocity  field 
to  the  smoother  coarse  mesh  field  and  calculating  the  sum  of  its  difference.  The  modeled 
dissipation  was  calculated  similarly,  by  using  a  box  average  of  the  exact  DNS  dissipation  field, 
and  comparing  to  the  known  total  dissipation  at  time  zero.  Because  dissiaption  tends  to  occur  at 
the  smallest  scales,  the  modeled  dissipation  is  very  close  to  the  total  dissipation  (very  little 
dissipation  is  resolved)  except  for  the  very  finest  mesh  (256x256x512). 

The  results  in  Figure  5  are  boring  to  look  at  but  profound.  They  show  that  at  any  mesh 
resolution,  the  model  predicts  the  decay  of  the  turbulence  accurately.  The  very  smallest  mesh 
resolution  is  clearly  a  RANS  simulation  and  the  largest  mesh,  256x256x512,  is  an  LES 
(bordering  on  DNS)  simulation.  The  intermediate  resolutions  might  be  considered  URANS, 
VLES,  or  LES.  The  spectra  for  the  initial  conditions  arc  shown  in  Figure  6.  These  spectra 
represent  VLES  (32x32x64),  LES  (64x64x128,  128x128x256),  and  LES/DNS  (256x256x512) 
simulations.  The  box  average  of  the  initial  data  takes  some  energy  from  the  lowest  wavenumbers 
but  most  of  the  energy  from  the  highest  wavenumbers.  All  energy  that  is  not  resolved  is  modeled 
by  the  spatially  and  temporally  varying  variable  k  , 


Figure  6.  Spectrum  of  initial  conditions  (32x32x64,  64x64x  1 28,  1 28x  1 28x256,  256x256x5 1 2) 
for  (Re=640)  isotropic  decaying  turbulence. 
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4:2  Kinetic  Energy  Ratio 


It  has  been  hypothesized  that  RANS  equations  must  give  RANS  (essentially  the  1x1x2)  results 
irrespective  of  the  mesh  resolution  used  to  solve  those  equations.  This  would  indeed  be  the  case, 
if  the  equations  were  linear.  Here  we  clearly  show  that  in  fact,  different  meshes  produces 
different  solutions  to  this  coupled  equation  system  (even  though  the  total  behavior,  figure  5, 
remains  the  same  for  all  meshes). 


0  \  2  3  A  5  6  7  8  9  10  71  12  13 

Tag 


Figure  7.  Ratio  of modeled  kinetic  energy  to  total  kinetic  energy,  Rc=640. 


The  ratio  of  the  modeled  kinetic  energy  to  the  total  kinetic  energy  is  shown  in  Figure  7  with  one 
curve  for  each  of  the  mesh  resolutions.  The  1x1x2  solution  is  the  lop  curve,  with  all  its  energy 
contained  in  the  model  (giving  a  ratio  of  1 .0)  and  the  256x256x512  simulation  is  the  bottom  line, 
with  the  smallest  ratio  of  modeled  kinetic  energy  (<10  percent).  Note  that  these  curves  are 
relatively  constant  and  decrease  slightly  with  time  as  the  simulation  proceeds.  They  do  not 
approach  each  other,  or  the  RANS  limit  (1.0).  Each  mesh  resolution  is  operating  with  a  different 
average  value  for  the  modeled  kinetic  energy.  Those  resolutions  that  require  less  modeling  of  the 
energy  start  with  less  model  energy  and  maintain  that  lower  level. 

The  LES  solutions  stay  entirely  unsteady  and  three-dimensional.  The  slight  decrease  in  the  ratio 
over  time  is  the  coned  behavior.  It  is  due  to  the  fact  that  over  time  the  Reynolds  number  of  the 
(low  is  slowly  decreasing  and  the  mesh  can,  and  therefore  does,  resolve  a  larger  percentage  of  the 
turbulent  fluctuations.  If  the  simulation  was  run  long  enough,  then  the  Reynolds  number  would 
drop  sufficiently  for  the  256x256x512  mesh  to  perform  DNS,  and  at  this  point  the  ratio  should  be 
essentially  zero.  The  small  variation  at  the  beginning  of  each  simulation  shows  the  initial 
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condition  is  close  to  the  correct  ratio  supportable  by  that  mesh,  but  not  perfect.  The  long  time 
behavior  of  this  ratio  is  shown  in  Figure  8,  These  decay  rates  are  consistent  with  the  Reynolds 
number  decay  rale  (Re  -  f l/5). 


Figure  8.  Ratio  of  modeled  kinetic  energy  to  total  kinetic  energy  for  extended  time*  Re=640, 


Figure  8  shows  that  the  coarse  mesh  (4x4x8)  might  approaching  the  'RANSf  solution  (a  ratio  of 
1 .0)  at  very  long  times.  When  the  mesh  is  larger  than  the  largest  eddies  this  is  possible.  The  finer 
meshes  (8x8x16  -  128x128x256)  have  enough  resolution  to  compute  a  solution  that  does  not 
return  to  the  RANS  limit.  It  is  hypothesized  that  steady  (RANS-like)  solutions  occur  when  the 
resolution  is  not  sufficient  to  maintain  an  energy  cascade.  For  this  lest  case,  the  initial  large  eddy 
lengthscale  of  the  turbulence  ( L  —  km!e )  is  6.0,  and  the  8x8x16  simulation  has  a  mesh  size  close 
to  7.0.  The  4x4x8  simulation  therefore  has  a  mesh  size  that  is  twice  the  size  of  the  energy 
containing  eddies.  It  can  not  resolve  many  eddies,  and  can  not  set  up  a  nonlinear  energy  cascade. 
Nikitin  et  al.u  has  noted  that  a  similar  affect  occurs  in  very  under-resolved  DES  simulations. 

4.  3  Perturbation  of  Initial  Conditions 

A  truly  adaptive  model  should  be  able  to  obtain  the  correct  behavior  from  incorrect  initial 
conditions.  For  example,  it  is  of  considerable  interest  to  see  if  a  64x64x128  mesh  initialized  with 
a  RANS  solution  can,  over  time,  develop  into  a  full  LES  simulation.  In  order  to  test  the  model  in 
this  way,  the  initial  conditions  were  either  smoothed  or  sharpened  using  a  filtering  operation.  The 
filter  used  to  alter  the  initial  conditions  was  a  nearest  neighbor  averaging  procedure, 

=  /H*  +  ( 1  -  /?>("», ,k  +",-u*  +  Vi*  +«,/-!*  +  *Vi  +  *V-i >7  ■  For  smoothing,  /?  =  0  was 

o 

used.  This  replaces  the  value  at  a  mesh  point  by  the  average  of  its  nearest  neighbors.  This  type  of 
filter  removes  energy  primarily  from  the  highly  oscillatory  modes  with  wavelengths  close  to  the 
mesh  size.  In  spectral  terms  it  damps  the  spectra  in  the  region  just  above  the  cutoff  wave  number. 


14 


The  affect  is  shown  in  Figure  9,  which  shows  the  original  initial  spectra  for  the  64x64x128 
simulation,  and  the  spectra  for  the  smoothed  and  sharpened  initial  conditions.  Sharpening  the 
velocity  field  is  performed  by  using  /?  =  1.5.  This  adds  energy  to  the  existing  high  frequency 
modes. 


Figure  9.  ID  energy  spectra  for  64x64x128  case,  Re=640  (Sharp,  Normal  Smooth). 


Figure  10  shows  the  affect  of  smoothing  and  sharpening  the  initial  conditions  on  the  kinetic 
energy  ratio.  When  smoothing  is  used,  energy  is  removed  from  the  resolved  inodes.  In  order  to 
keep  the  total  kinetic  energy  the  same,  the  model  now  must  start  with  more  modeled  kinetic 
energy.  The  ratio  therefore  starts  higher  than  before.  We  would  like  this  model  energy  to 
backscatter  into  the  resolved  velocity  field,  since  the  mesh  can  in  fact  support  more  turbulence 
than  was  introduced  at  the  initial  condition. 

Note  that  as  time  proceeds  the  model  achieves  the  same  ratio  irrespective  of  the  initial  conditions. 
At  early  times,  the  smoothed  solution  has  less  error  and  therefore  baekscatters  somewhat  more 
than  the  unperturbed  initial  condition.  This  removes  energy  from  the  model  and  makes  the  ratio 
decrease  faster,  so  that  it  approaches  its  original  state.  A  similar  (but  opposite)  process  happens 
when  the  spectra  is  sharpened.  In  this  case,  the  model  senses  that  the  mesh  can  not  support  the 
input  resolved  fluctuations,  and  quickly  sends  that  energy  to  the  model  (by  damping  the  resolved 
velocities).  Note  that  the  rate  at  which  the  model  adjusts  to  perturbed  initial  conditions  depends 
on  the  mesh  resolution.  The  higher  mesh  resolutions  adjust  more  rapidly.  It  is  hypothesized  that 
the  time  it  takes  to  transfer  the  energy  scales  on  the  timescale  of  the  turbulence  at  the  cutoff 
(transfer)  lengthscale  (k/£). 
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Figure  10,  Ratio  of  modeled  kinetic  energy  to  total  kinetic  energy,  Re-640,  for  perturbed  initial 

conditions. 


4,4  Low  Rc  Isotropic  Decay 

In  order  to  verify  that  the  model  works  in  the  DNS  regime,  a  lower  Reynolds  number  simulation 
was  performed  which  is  well  resolved  (DNS)  at  the  largest  simulation  of  256x256x512.  The 
initial  condition  for  this  lower  Reynolds  number  case  was  generated  by  running  the  high  Re 
simulation  with  a  larger  viscosity  {v  =  0.3743  cm2 is)  and  with  a  very  small  amount  of  modeled 
kinetic  energy  ( k  =  ICT*cjh2/s2),  Using  the  higher  viscosity  reduces  the  Reynolds  number  and 
also  quickly  damps  the  smallest  scales  of  the  turbulence. 

The  turbulent  time  scale  ,  is  plotted  versus  simulation  time  (solid  line)  in  Figure  1 1  for  the 

low  Re  case.  As  was  expected,  at  early  times  the  turbulence  is  adjusting  and  the  curve  is  not 
linear.  This  is  due  to  the  simulation  reacting  to  the  increased  viscosity  and  reduced  modeled 
kinetic  energy.  Eventually,  a  power  law  decay  for  the  kinetic  energy  is  obtained,  and  the 
simulation  is  considered  fully  adjusted.  A  power  law  decay  should  appear  as  a  straight  line  in 
Figure  1 1.  A  linear  curve  fit  is  given  by  the  plus  symbols.  The  inverse  of  the  slope  given  by  the 
curve  fit  is  the  kinetic  energy  decay  exponent,  rr~  1,38.  The  simulation  was  considered  to  be 
well  developed  at  a  time  of  0. 1 4s.  This  field  (at  t=0.14s)  was  used  in  all  the  subsequent 
simulations,  as  the  well  developed  low  Reynolds  number  (Re=2l 1)  initial  condition.  The  same 
box  averaging  procedure  as  the  high  Re  number  case  was  used  to  create  initial  conditions  for  the 
smaller  meshes. 
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Figure  l  L  Turbulent  time  seale  (kf/£t)  vs.  simulation  time  for  a  256x256x512  simulation  with 
increased  viscosity  and  small  amount  of  modeled  kinetic  energy  (DNS). 

Figure  12  shows  the  total  kinetic  energy  predictions  for  the  low  Re  test  case.  The  circles  represent 
the  essentially  DNS  simulation  (though  the  model  is  on),  the  lines  represent  the  various 
simulations  from  1x1x2  through  128x128x256.  These  are  in  fairly  good  agreement  with  the  DNS 
data,  as  was  observed  for  the  high  Reynolds  number  case.  The  slight  discrepancy  at  long  times  is 
most  likely  due  to  errors  in  the  R  A  NS  model  at  lower  Re. 


Tau 


Figure  12.  Total  kinetic  energy  predictions  of  isotropic  decay  at  initial  Re=2 1 1. 
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Figure  13.  Ratio  of  modeled  kinetic  energy  to  total  kinetic  energy,  Re=21 1 . 


The  ratio  of  modeled  kinetic  energy  is  plotted  in  Figure  13,  The  full  RANS  simulation  (1x1x2)  is 
at  the  top  of  the  figure  (ratio  of  1.0)  and  the  DNS  is  shown  at  the  bottom  of  the  figure  with  a  ratio 
of  roughly  IQ”3 .  Once  again  it  is  observed  that  there  is  no  tendency  for  these  solutions  to  move 
towards  the  RANS  solution  (ratio  of  1.0).  The  LES  solutions  stay  entirely  unsteady  and  three- 
dimensional,  and  the  correct  decrease  in  this  ratio  over  time  is  shown.  The  128x128x256  mesh 
is  close  to  DNS  and  the  256x256x512  mesh  is  clearly  a  DNS  simulation  even  though  the  model 
is  technically  on. 


4.5  Scaling 

In  classic  LES  models  the  lengthscale  is  assumed  to  be  proportional  to  the  mesh  size,  Ax ,  and  it 

1  A 

can  shown  that  the  gradients  scale  like  ut  f  ~  e*A  } .  The  eddy  viscosity  is  then  constructed  from 

I  4 

this  scaling  and  has  the  form,  vT  ^  u.  A1  ^  £-  A3,  The  use  of  the  mesh  size  to  infer  the 

lengthscale  is  why  classic  LES  fails  outside  of  the  inertial  range  (close  to  the  DNS  or  RANS 
regimes).  It  is  also  one  reason  why  the  proposed  self-adaptive  model  (which  predicts  the 
lengthscale  and  does  not  infer  it)  can  operate  at  any  mesh  size.  However,  the  current  model 
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should  still  obtain  the  classical  LES  scaling  in  the  inertial  subrange.  In  other  words,  the 
lengthscale  predicted  by  the  model  {Lm  =  kml£ )  should  be  proportional  to  the  mesh  size,  when 
the  model  is  operating  in  the  LES  regime. 

Figure  14  looks  at  predicted  lengthscale  (in  log  scale)  at  a  fixed  time  1=0.5  for  the  various 
meshes.  At  small  values  of  Ax  (large  number  of  mesh  points)  it  is  obvious  that  the  lengthscale  is 
indeed  a  linear  function  of  grid  spacing  as  would  be  expected  with  classic  LES.  A  reference  line 
has  been  added  (dotted  fine)  with  a  slope  of  L0  for  comparison.  If  a  well  defined  -5/3  slope  were 
to  be  observed  in  the  energy  spectra,  one  would  assume  that  classic  LES  would  closely  match  the 
reference  line  in  that  -5/3  region. 


Figure  14.  Lengthscale  behavior  at  time  =  0.5s. 


Because  the  spectra  at  this  moderate  Reynolds  number  (Figures  6  and  9)  show  only  a  weakly 
observable  -5/3  slope,  it  is  not  surprising  that  the  lengthscale  shown  in  figure  15  does  not  match 
the  reference  line  exactly.  One  would  also  expect  viscous  effects  (the  influence  of  a  viscous 
lengthscale)  to  reduce  the  scaling  below  linear  and  this  is  indeed  observed.  The  figure  clearly 
shows  how  the  turbulent  lengthscale  is  not  related  to  the  mesh  spacing  outside  the  LES  regime. 

Mesh  resolutions  smaller  than  64x64x128  do  not  exhibit  the  linear  behavior  because  now  the 
relevant  turbulent  lengthscale  is  no  longer  related  to  the  mesh  size.  There  is  a  transition  region 
that  one  might  call  VLES  (32x32x64)  but  not  full  LES,  The  8x8x16  mesh  might  be  considered 
the  start  of  URANS  (as  shown  in  Figure  14),  where  the  largest  scales  can  just  be  resolved,  and 
unsteady  3D  solutions  maintained.  These  smallest  mesh  sizes  are  where  the  solutions  can  not 
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sustain  a  cascade  because  all  the  largest  turbulence  has  scales  that  are  smaller  than  the  mesh  size. 
Interestingly,  Carali  cl  a!.3  conducted  LES  simulations  for  isotropic  decaying  turbulence  and 
determined  that  the  smallest  mesh  size  possible  for  a  classic  (dynamic  model)  LES  simulation 
was  48” .  This  is  in  very  good  agreement  with  Figure  14,  and  is  just  where  the  elbow  in  the  curve 
occurs. 


5  Parallel  Programming 

Even  with  current  technological  advances  in  computer  hardware,  large  simulations  can  require  an 
extraordinary  amount  of  computer  resources.  Performing  a  256x256x5 1 2  simulation  for  these  test 
cases  requires  the  computer  to  solve  approximately  34  million  grid  points.  It  is  obvious  that  a 
standard  PC  will  not  be  able  to  handle  the  computational  requirements  for  this  case  anytime  in  the 
near  future.  In  order  to  decrease  simulation  time  and  lower  the  memory  requirements,  the  code 
was  parallelized.  This  was  accomplished  by  utilizing  the  Messaging  Passing  Interface  (MPI) 
library.  MPI  is  a  library  of  functions  for  Fortran  and  C/C++  that  distributes  information  from  a 
single  processor  to  multiple  processors.  Using  more  processors  allows  the  larger  simulations  in 
this  work  to  be  accomplished, 

5J  Boundary  Conditions 

While  MPI  facilitates  the  transfer  of  data  between  processors,  care  must  be  taken  to  ensure  the 
correct  data  is  passed  between  neighboring  processors.  All  simulations  performed  in  this  work 
use  periodic  boundary  conditions  as  shown  on  the  left  side  of  Figure  15. 


FULL  DOMAIN 


MPI 


CHOS7  CELLS 


PARTLAi  DOMAIN 


Figure  15.  Single  processor  (left  side)  and  MPI  (right  side)  partitioning  of  the  problem. 


'This  figure  (left  side)  shows  how  data  is  stored  and  distributed  for  periodic  boundary  conditions 
on  a  single  processor.  All  of  the  interior  points  are  within  a  domain  of  Lx  and  Ly  (shaded  area) 
and  stored  on  a  single  processor.  The  ghost  cells  are  used  for  the  periodic  boundary  conditions 
and  are  given  by  the  values  of  the  interior  points  at  each  face  in  this  rectangle.  Here,  Nemo  the 
fish  is  seen  swimming  out  of  the  left  side  of  the  domain.  Because  of  periodicity,  he  is  wrapped 
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around  to  the  opposite  face  he  just  left,  and  is  seen  swimming  into  the  domain  from  the  right  hand 
side.  Similarly,  if  he  swims  out  of  any  other  face,  he  will  map  around  to  the  opposite  face  he  just 
left.  This  ensures  that  Nemo  continues  swimming  inside  the  periodic  domain. 


Suppose  that  the  same  data  set  is  desired  to  be  split  in  half,  then  each  processor  should  get  half  of 
the  interior  points  as  shown  (right  side  of  Figure  15).  Here  the  x -direction,  with  an  initial  domain 
of  (Lx)  has  been  split  in  half  (Lx/2)  for  two  processors.  It  is  now  apparent  that  care  must  be  taken 
when  determining  the  correct  transfer  of  data  between  these  two  processors.  The  only  way  to 
correctly  determine  this  information  is  by  allowing  every  processor  to  locate  its  neighbor  for  each 
of  the  (cubic)  domain's  six  faces.  From  Figure  15  for  the  partial  domain,  interior  information 
must  be  shared  with  neighboring  processors  (given  by  the  smiley  face).  However,  at  domain 
edges,  the  boundary  information  should  wrap  around  to  the  opposing  face,  (given  by  Nemo).  An 
algorithm  was  developed  to  do  exactly  this  and  is  explained  in  detail  in  the  next  section, 

5,2  Neighboring  CPU  Data 

The  algorithm  used  to  compute  neighboring  CPU  information  is  detailed  in  Appendix  B.  Figure 
16  shows  a  parallel  simulation  consisting  of  eight  processors.  The  left  side  of  Figure  16  shows  the 
data  together,  the  right  side  shows  the  front  and  back  plane  separated  for  clarity  with  each  of  the  8 
processors  numbered  from  0-7.  Since  each  cube  of  data  has  six  sides  it  is  clear  that  there  will  be 
six  neighboring  CPU  locations  calculated  for  each  processor  This  algorithm  uses  simple  integer 
division  to  allow  each  processor  to  find  its  correct  neighbors  for  an  arbitrary  partition.. 


PARTITION 


2 

3 
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Figure  16,  Neighboring  CPU  location. 


By  visual  inspection  the  processor  to  the  right  of  0  is  processor  L  There  is  no  processor  to  the  left 
of  0,  however  periodicity  tells  us  that  the  information  should  wrap  around  from  processor  l .  The 
processor  above  0  is  processor  2,  and  below  is  also  2  (from  periodicity).  Finally,  the  processor  jn 
front  of  0  is  4  (periodicity)  and  the  processor  behind  0  is  also  processor  4, 
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This  simple  reasoning  is  exactly  what  the  algorithm  computes  in  Appendix  B.  For  the  processor 
in  question,  the  MPJ  assigned  integer  'myid'  is  0,  Because  there  are  two  processors  in  every 
direction,  NXP  =  NYP  =  NZP  —  2  .  The  values  of  ix,  iy,  and  \z  are  calculated  as  I .  The  CPU  to 
the  right  of  processor  0  will  be  calculated  as  (CPUinfol  -  l  ),  the  CPU  to  the  left  (CPUinfo2  -  I ), 
the  CPU  at  the  top  and  bottom  are  given  by  (CPUinfo3  =  2,  CPUinfo4  =  2)  respectively.  Finally, 
the  CPU  located  at  the  back  and  front  of  the  CPU  in  question  are  (CPUinfoS  =  4,  CPUinfo6  =  4) 
respectively.  Simple  integer  division  based  on  the  variable  'myid'  computes  neighboring 
processor  locations.  The  TIf  statements  shown  in  Appendix  B  are  used  whenever  a  processor  is  on 
the  edge  of  a  domain  (such  as  computing  the  processor  to  the  left  of  0).  If  any  of  these  processors 
are  on  this  domain  edge,  it  is  computed  that  boundary  information  will  wrap  around  to  ensure 
periodicity.  It  is  clear  from  the  above  figure  that  the  algorithm  has  correctly  determined  the 
location  of  the  neighboring  CPUs  relative  to  processor  0. 

5,3  Transfer  of  Boundary  Condition  Information 

With  everything  in  place,  the  actual  passing  of  the  boundary  information  will  now  be  explained  in 
detail.  The  cell  indexing  convention  is  shown  in  Figure  17.  This  shows  that  the  interior  points  are 
given  by  the  data  values  from  1-NX  and  1-NY,  where  NX  is  the  number  of  points  in  the  x- 
direction  and  NY  is  the  number  of  points  in  the  y-direction.  The  ghost  cells  (dashed  lines)  used 
for  the  boundary  conditions  are  placed  at  index  locations  around  the  interior  values.  The  location 
of  cell  1  (0,0),  cell  2  (NX+1,0),  cell  3  (NX+1.NY+I),  and  cell  4  (0.NY+I)  are  the  four  comers  for 
this  2d  example.  The  correct  procedure  must  be  followed  in  order  for  these  four  comer  cells  to 
contain  the  correct  data.  The  z-di  recti  on  (out  of  the  page)  follows  the  same  logic,  but  for  clarity  is 
not  shown. 


CELL  4  CELL  1 

X - X  NY+ 1 


0  I  NX  N X  + 


CELL  INDEXING 


Figure  17.  Cell  indexing  convention. 
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In  order  to  enforce  periodic  boundary  conditions  on  this  cell,  there  are  four  update  steps  for  this 
2d  example.  These  update  steps  are  shown  in  Figure  18.  Step  one  updates  the  ghost  cells  with 
values  from  the  right  side  interiors  (NX,1-NY)  as  shown.  Because  interior  values  are  used,  these 
are  correct.  The  second  step  lakes  the  top  interior  values  (l -NX, NY)  along  with  ghost  cell  values 
at  (0,NY)  and  (NX+1,NY).  When  transferred  to  the  bottom  ghost  cell  plane,  cell  2  now  has 
incorrect  information  as  shown.  To  correct  this  cell,  the  third  step  is  performed.  This  step  lakes 
the  left  interior  values  (1,1-NY)  along  with  ghost  cell  values  at  (1,0)  and  (l.NY+l),  and  transfers 
these  to  the  ghost  cells  on  the  right  side.  When  step  three  has  completed,  the  incorrect  cell  2  value 
has  been  replaced  with  a  correct  value,  but  now  cell  3  contains  bad  information  as  shown. 
Finally,  the  fourth  step  takes  the  bottom  plane  (O-NX+1,1)  and  transfers  these  corrected  values  to 
the  top  ghost  cells  at  NY+1.  Now  cell  3  has  been  replaced  with  correct  data  and  all  ghost  cell 
information  is  correct.  This  exact  procedure  must  be  followed  in  this  precise  order  to  ensure  that 
all  ghost  cells  (especially  the  comers)  are  indeed  updated  with  the  correct  information.  For  three 
dimensions  (z-direction)  would  also  have  two  additional  update  steps  (after  x  and  y),  following 
the  same  logic. 
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Figure  18.  Ghost  cell  update  procedure. 
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The  Fortran  code  used  for  the  the  procedure  in  Figure  18  is  given  in  Appendix  C  For  brevity 
only  the  x-direction  will  he  explained  in  great  detail,  it  really  is  trivial  to  understand  and  apply  the 
same  information  for  the  y  and  z  directions.  The  values  of  NX,  NY,  NZ  are  given  by  the  number 
of  interior  points  in  the  domain  for  each  direction.  The  variable  (xx)  is  a  3D  array  of  data  with 
correct  interior  domain  information,  but  with  incorrect  boundary  information  stored  at  the  ghost 
cells.  The  routines  SendBCX  and  RccvBCX,  are  given  in  Appendix  C,  These  are  the 
procedures  that  update  the  ghost  cell  information. 


Subroutine  pPeriodic _BC (xx) 


Call 

Send_ 

BCX ( NX , NY , NZ , xx ) 

iSend 

B/C 

info 

for 

X 

dir 

Call 

Recv 

BCX (NX, NY, NZ, xx) 

I  Recv 

B/C 

info 

for 

X 

dir 

Call 

Send^ 

BCY (NX , NY , NZ , xx) 

!  Send 

B/C 

info 

for 

Y 

dir 

Call 

Recv_ 

BCY ! NX , NY , NZ , xx ) 

I  Recv 

B/C 

info 

for 

Y 

dir 

Call 

Send 

BCZ ( NX , NY , NZ , xx ) 

ISend 

B/C 

info 

for 

2 

dir 

Call 

Recv 

_BCZ  (NX  ,  NY ,  NZ ,  xx) 

l  Recv 

B/C 

info 

for 

z 

dir 

End  Subroutine  pPeriodic  BC 


The  routine  above  performs  ghost  cell  updates  1  and  2  with  Send/Recv  BCX.  Ghost  cell  updates 
3  and  4  are  performed  with  Send/Recv  BCY,  The  send  operation  in  Send  BCX  uses  the  MPi 
function  call  MPI  IS  END  and  MPIIRECV  (commonly  used  MPI  functions  are  shown  in 
Appendix  D),  There  are  two  basic  ways  to  transfer  information  with  MPI,  these  are  blocking  and 
non-blocking  operations.  A  blocking  operation  (such  as  MPI  SEND)  will  perform  the  function 
call  requested  and  also  halts  all  processors  from  continuing  any  further  until  the  requested  process 
is  completed,  A  non -blocking  operation  (such  as  a  MPMSEND)  will  send  the  required  data  to  all 
processors,  but  will  not  halt  any  processors  from  doing  useful  work  while  the  send  operation  is  in 
progress.  This  (will  be  shown  shortly)  can  affect  the  overall  performance  of  the  code  and  care 
should  be  taken  to  decide  which  operation  to  perform.  It  was  decided  that  the  boundary 
conditions  will  use  non-blocking  send/receives  and  use  MPIWA1TALL  to  guarantee  this  data  is 
received  in  the  appropriate  order  of  Figure  18.  The  MPI  WAITALL  command  (Appendix  D) 
allows  the  programmer  to  decide  when  a  manual  block  should  be  placed  m  the  code.  The 
structure  of  the  pPeriodic  BC  routine  uses  MPMSEND  and  MPMRECV  to  send  and  receive 
the  correct  ghost  cell  boundary  information.  Because  these  are  non-blocking,  all  of  this  data  is 
sent  at  once  to  all  processors.  The  routine  Recv  BCX  is  just  a  MPI  WA1TALL  that  will  not  let 
any  processor  continue  until  all  of  the  send/receives  for  the  x -direct ion  are  finished.  Because  non- 
blocking  operations  are  performed,  if  the  WAtTALL  incurs  any  performance  penalty  at  all,  it  will 
be  much  less  than  performing  a  blocking  send  and  a  blocking  receive.  This  same  methodology  is 
applied  lo  the  y  and  z  directions.  This  strictly  guarantees  that  the  x-direction  information  is  sent 
and  received  first,  y-direction  follows  second,  and  the  z  direction  is  last. 
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The  reader  is  now  referred  to  the  ScndBCX  subroutine  in  Appendix  C.  The  variables  sent  into 
this  routine  are  the  NX,NY,NZ,  and  the  3D  array  that  is  to  be  updated  with  periodic  boundary 
conditions.  A  send  buffer  is  created,  (sbuffxr)  that  contains  the  values  of  the  y-z  plane  at  fixed 
interior  x-Iocation  (beginning  (l)  or  ending  (NX)),  To  take  out  any  ambiguity,  the  send  buffer  (x- 
direction)  left  plane  and  send  buffer  (x-direction)  right  plane  are  clearly  shown  in  Figure  19, 
along  with  orientation  axis. 


y 

Jl  TOP 

BACK 


Figure  19.  Orientation  and  planar  data  values  used  by  periodic  boundary  conditions. 


As  is  shown  in  Recv  BCX  routine  these  x-direction  planar  values  are  now  (once  the  operation  is 
complete)  used  to  update  the  ghost  cell  information  for  the  input  array  that  is  desired  to  be 
periodic.  For  clarity  these  values  arc  given  by  the  receive  buffer  (x-direction)  right  side  (rbuffx  r) 
and  receive  buffer  (x-direction)  left  side  (rbuffxl),  With  the  process  explained  in  detail,  it  should 
be  apparent  that  the  y  and  z  directions  follow  the  exact  same  procedure  outlined  above,  with  only 
slight  changes  in  orientation  and  by  the  naming  convention  of  the  send  and  receive  buffers.  For 
example  the  send  buffer  (y-direction)  top  x-z  planar  data  is  stored  as  sbuiTy  _l,  and  the  sent 
bottom  planar  data  is  stored  as  sbuffy  b,  etc. 
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5.4  MPI  I/O 


The  CPU  order  shown  in  Figure  16  is  determined  by  the  input  routine  shown  below. 


NXP  =  DNX /NX  ! 

# 

processors 

in 

x-dir 

NYP  =  DNY/ NY  ! 

# 

processors 

in 

y-dir 

NZP  =  DNZ/NZ  ! 

# 

processors 

in 

z-dir 

!  CPU  location  indices 

kl  =  myid/ (NXP*NYP)  +  1 
kr  =  mod (myid, NYP*NXP) 
jl  =  kr/NXP  +  1 
il  =  mod(kr,NXP)  +  1 


!  read  in  the  total  field  and  store  in  a  temporary  array 
!  called  uJ. 

open  (unit  =  503 , f ile=trim (str4 ) , form= 1  formatted  1 ) 

read (503, *)  { ( (uJ (i , j , k)  , i  =  0, DNX  +  1) , j  =0, DNY+1 ) , k=0, DNZ  +  1) 

!!  now,  let  every  processor  get  it’s  own  "chunk"  of  data, 
tke ( 0 :NX+1 , 0 : NY+1 , 0 :NZ+1 )  = 

uJ (NX* ( i 1 - 1 )  :  ( il*NX) + 1 , NY  * ( jl-1)  :  ( j 1 *NY) +1,NZ* (kl-1 )  :  (kl*NZ) +1 

> 

close (unit=503)  ! close  the  large  field  uJ. 


The  input  routine  is  shown  for  the  kinetic  energy  (tke).  All  other  data  Helds  (u,  v,  w,  eps)  are  read 
in  exactly  the  same  way.  In  Fortran  the  expression  mod(n.m)  gives  the  remainder  when  n  is 
divided  by  m;  it  is  applied  to  integers.  Examples  are  mod(8,3)  =  2,  mod(27,4)  =  3,  mod(l  1,2)  =  I, 
mod(20,5)  =  0.  The  number  of  processors  in  each  direction  are  determined  by  dividing  the  total 
domain  size  (DNX,  DNY,  DNZ)  into  smaller  problem  sizes  (NX,  NY,  NZ).  Then  integer  division 
and  the  mod(n,m)  expression  are  used  to  compute  three  integer  variables  (il,  jl,  kl).  The  problem 
size  plus  boundaries  is  then  defined  for  every  processor  and  the  code  uses  (il,  jl,  kl)  to  correctly 
grab  a  'chunk'  of  data  from  the  total  domain.  Finally  the  large  (total)  array  uJ  is  closed. 
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Refer  to  (lie  domain  decomposition  show  in  Figure  20,  where  a  domain  is  split  in  half  (x- 
direction).  Assume  (for  this  example)  that  we  wish  to  run  a  64x64x128  simulation  on  two 
processors  as  shown.  The  total  domain  is  given  as  (DNX=64,  DNY=64,  DNZ=128),  the  problem 
size  for  each  processor  would  be  (NX=32,  NY  =64,  NZ=128). 
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Figure  20,  Single  processor  (left  side)  and  MPI  (right  side)  partitioning  of  the  64x64 x  1 28 

example  problem. 

MPI  will  randomly  assign  the  variable  ’my id1  to  each  processor  it  uses.  In  this  case,  two 
processors  would  be  available  with  ’mykHO1  and  *myid=F.  To  ensure  that  the  correct  processor  is 
located  exactly  as  shown  in  Figure  20,  the  input  routine  above  is  used.  This  routine,  reads  in  the 
total  energy  array  (with  boundaries)  and  stores  this  in  a  temporary  array  (uJ(G;65, 0:65,0: 129))  on 
each  CPU.  The  processor  with  'myid  =  O',  computes  (il=l,  jl=l,  kl“l).  This  stores  its  'chunk1  of 
kinetic  energy  (tke)  with  information  from  the  total  domain  uj{0:33,  0:65,  0:129),  exactly  as 
shown  in  Figure  20.  Likewise,  the  processor  'myid  =  \\  computes  (il=2t  jl=l,  kl~l),  which  stores 
it’s  (tke)  from  the  values  ui(32:65,  0:65,  0:129).  Finally,  now  that  the  large  temporary  array  (uJ) 
has  been  partitioned  correctly,  it  is  closed.  It  is  clear  from  this  example,  that  these  arc  indeed  the 
correct  indices  for  the  input  arrays.  This  guarantees  that  the  input  data  will  indeed  follow  the 
numbering  sequence  shown  in  Figure  18,  with  the  appropriate  interior  domain  and  boundary 
(ghost  cel!)  information. 

5,5  Data  Output 

The  output  routine  (Appendix  F)  follows  exactly  the  same  logic  as  the  input  routine,  except  in 
reverse  order.  Although  this  routine  is  shown  for  kinetic  energy  (tke),  the  other  fields  (u,  v,  w, 
eps)  follow  the  same  procedure.  Typically  a  root  processor  is  chosen,  usually  'myid=0\  is 
selected.  This  root  processor  will  be  the  processor  that  collects  all  data  from  other  CPUs  using  a 
(MPI  RECV)  and  pieces  together  the  total  domain  by  using  the  values  of  il,  jl  ,kl.  A  simple  If 
statement  allows  all  other  processors  (not  root)  to  send  (MPI  Send)  their  data  along  with  the 
corresponding  values  (myid,  il,  jl,  kl).  Once  the  root  processor  has  received  all  of  the  data  from 
each  processor  and  pieced  together  the  total  domain,  it  then  writes  out  data  for  postprocessing 
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use.  The  output  routine  shown  does  just  this,  and  is  clearly  explained  in  Appendix  F  by  the 
programmer  comments. 

5.6  Global  Calculations 

This  last  section  deals  with  global  variables  used  in  the  calculations  of  the  code.  Namely,  these 
are  the  SUM,  MINVAL,  and  MAXVAL  Fortran  calls.  With  the  total  domain  now  decomposed 
and  sent  to  'N1  number  of  processors  there  must  be  a  way  to  figure  out  the  total  SUM  of  data 
values  globally  (not  just  on  each  processor).  Similarly  the  MINVAL/MAXVAL  functions  must 
give  values  for  the  total  domain  and  not  for  each  processor.  In  order  to  accomplish  this,  three 
extra  routines  were  used  in  conjunction  with  the  MPIALLREDUCE  function  (Appendix  D). 
These  three  routines  arc  shown  in  Appendix  E,  It  should  be  straight  forward  to  see  that  the 
ALL  REDUCE  function  simply  combines  the  values  from  all  processors  to  calculate  a  desired 
global  value.  T  his  global  value  is  then  distributed  to  all  processors.  For  example,  if  an  array  is 
given  by  {1.0  2.0  3.0  4.0),  and  each  value  is  sent  to  a  parallel  job  with  four  processors,  each 
processor  stores  the  entries  1.0  -  4.0.  If  a  Fortran  SUM  is  performed,  processor  1  calculates  the 
total  sum  of  the  given  array  as  1.0,  processor  2  calculates  the  total  sum  to  be  2.0,  etc.  However 
with  MPI  ALLREDUCE,  the  actual  global  sum  of  {1.0  +  2.0  +  3.0  +  4.0=10.0}  is  calculated 
and  this  global  value  is  then  sent  to  every  processor 

5*7  Parallel  Optimization 

With  all  the  pieces  mentioned  above  in  place,  the  code  was  fully  parallelized.  Extensive  testing 
and  debugging  proved  that  the  parallel  version  of  the  code  gives  (almost  to  machine  precision) 
the  same  solution  as  the  serial  version.  The  next  and  last  step  was  to  optimize  the  code  in  order  to 
fully  take  advantage  of  using  multiple  processors.  Optimization  in  parallel  consists  of  efficiently 
sending,  receiving,  and  computing  data. 

Start^MPI 
l Non- Optimized 

Compute  Sqrt  (Data  B)  [computes  square  root  of  Data 

MPISEND {Data  A)  ! (blocking)  send  to  all  processors 
(takes  a  while) 

[processors  sit  idle  at  this  point  until  the  send  is 
completed* 

Answer  =  (Data  A  and  Sqrt (Data  B)  ) 

[Optimized 

MPIISBND (Data  A)  ! non-blocking  send  to  all 

processors  (takes  a  while) 

Compute  Sqrt  (Data  B)  [with  send  in  progress, 

computer  continues  to  work 

[once  the  Sqrt  is  calculated,  the  send  has  completed 

MPIWAITALL  [manual  stopping  point  to  make 

sure  Data  A  was  sent 

Answer  =  (Data  A  and  Sqrt (Data  B) ) 

EndMPI 

A  simple  example  is  shown  obove,  where  a  notvoptimized  and  an  optimized  solution  arc 
computed  with  two  arrays  (Data  A,  Data  B).  The  non-oplimized  code  computes  the  square  root  of 
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(Data  B)  and  then  sends  (Data  A)  to  all  processors.  At  this  point,  the  answer  cannot  be  computed 
until  this  sending  operation  has  been  completed.  Recall  that  MPI  SEND  is  a  blocking  function 
call,  and  will  halt  all  processors  until  the  sending  operation  is  completed.  If  (Data  A)  is  very 
large,  this  send  operation  can  take  a  long  time,  and  at  this  point  all  processors  sit  idle  and  wait. 
When  the  send  is  completed  the  solution  (Answer)  is  calculated.  Obviously  this  will  yield  a 
correct  answer,  but  it  is  not  the  best  use  of  available  resources  because  now  art  additional  penalty 
is  incurred  by  the  communication  time  between  processors, 

The  optimized  code  however,  will  yield  the  correct  answer,  and  will  do  so  faster!  This  is  because 
the  MP1ISEND  is  performed  (as  shown).  Recall  that  MPI  ISEND  is  a  non-blocking  operation. 
While  this  information  is  being  sent,  there  is  no  reason  why  the  processors  cannot  continue  to  do 
some  useful  work.  As  (Data  A)  is  sent  (optimized  code),  all  processors  begin  computing  the 
square  root  of  (Data  B).  When  the  square  root  has  been  calculated,  most  (or  ideally  all)  of  (Data 
A)  has  been  sent.  If  all  of  (Data  A)  was  sent  during  the  computation  on  (Data  B),  the  WAIT  ALL 
function  incurs  no  communication  penalty.  Even  if  all  of  (Data  A)  was  not  sent,  the  WAIT  ALL 
still  takes  less  time  than  the  blocking  send  performed  by  the  non-optimized  code.  The  final 
answer  will  be  computed  much  faster.  This  hides  the  MP1  communication  times  behind  actual 
computational  work  in  order  to  develop  an  ideally  optimized  code.  By  letting  each  processor  do 
as  much  work  as  possible  during  these  MP1  communication  function  calls,  better  parallel 
performance  is  obviously  obtained.  This  idea  was  mentioned  above  and  shown  to  be  effective  for 
the  periodic  boundary  condition  routine  shown  in  Section  4. 1 2, 

Figure  21  shows  the  total  simulation  time  for  a  128x128x256  and  a  64x64x128  simulation  before 
and  after  MPI  optimization.  This  figure  shows  that  onee  parallelized,  the  total  simulation  time  for 
a  128x128x256  simulation  was  5.2  hours  and  0.71  hours  for  the  64x64x128  case,  using  eight 
processors.  After  the  code  was  optimized  by  hiding  as  much  of  the  communication  time  as 
possible,  the  total  simulation  time  for  both  cases  (again  with  8  processors)  dropped  by  roughly  20 
percent.  This  reduction  with  the  optimized  code  saves  valuable  supercomputing  time. 
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Figure  21.  Total  simulation  time  (hrs)  for  64x64x1 28  and  128x1 28x256  before  and  after 

optimization  on  the  UMass  cluster. 


5*8  MPI  Performance 
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Figure  22  shows  the  parallel  performance  (log  scale)  as  recorded  on  the  UMass  cluster  named 
'von  Kaiman'.  This  figure  shows  the  total  simulation  time  (min)  versus  the  number  of  CPUs  used. 
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Figure  22.  Total  simulation  time  (min)  on  the  UMass  cluster  vs.  number  of  CPUs. 


It  is  apparent  that  for  the  32x32x64  simulation  (squares),  that  using  additional  processors  with 
von  Kannan  increases  the  time  to  run  a  simulation.  This  is  because  the  communication  time 
between  processors  is  greater  than  actual  processor  work.  Luckily,  the  64x64x128  (diamonds) 
and  128x128x256  (x*s)  have  enough  work  on  each  processor  so  that  the  communication  time  is 
better  hidden  and  is  much  lower.  Here  we  see  that  using  additional  processors  immediately 
decreases  the  simulation  time.  Because  the  UMass  cluster  was  used  extensively  for  testing  and 
debugging,  a  plot  of  MP1  performance  such  as  Figure  22  was  needed.  It  should  be  noted  that  von 
Kannan  is  an  old  machine  with  a  slow  interconnect  between  processors.  This  is  especially 
noticeable  at  the  32x32x64  mesh.  A  more  accurate  comparison  of  parallel  performance  is  shown 
in  Figure  23. 
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Figure  23.  Speed-up  factor  vs.  number  of  processors  (UMass  and  ARSC). 


This  figure  shows  the  parallel  performance  of  the  developed  code  as  recorded  on  the  Arctic 
Region  Supercomputing  center  IBM/P69Q  named  'Iceberg'  (symbols  with  solid  lines)  compared 
against  'von  Karman '  (symbols  with  dashed  lines).  The  speed-up  factor  is  given  by  the  total 
simulation  time  on  a  single  processor  divided  by  the  simulation  time  for  multiple  processors.  It  is 
not  necessarily  a  measure  of  how  fast  a  given  CPU  is,  but  rather,  a  measure  of  the  communication 
time  between  processors  for  any  given  architecture.  The  ideal  performance  is  also  shown  (solid 
line  -  circles)  and  shows  a  linear  relationship.  For  example,  using  twice  as  many  processors, 
should  give  (ideally)  a  speed-up  factor  of  two,  etc.  Because  von  Karman  has  a  slow  interconnect, 
it  is  not  surprising  to  see  that  it  gives  the  worst  performance  (bottom  three  curves).  Von  Karman 
does  not  even  begin  to  follow  the  ideal  speed-up  curve  until  the  128x128x256  mesh  size.  Even  at 
this  large  mesh,  the  slow  communication  time  between  processors  only  gives  a  speed-up  factor  of 
4.7  for  eight  processors.  More  interesting  however,  is  how  well  the  DoD  funded  supercomputer 
performs  with  the  developed  code.  It  has  much  faster  communication  between  processors. 

It  is  clear  from  Figure  23  increasing  the  number  of  processors  on  Iceberg  immediately  results  in  a 
faster  simulation  time.  This  is  shown  by  (solid  lines  -  symbols)  the  close  behavior  of  the 
simulations  on  Iceberg  to  the  ideal  speed-up.  Even  with  the  small  32x32x64  simulation,  Iceberg's 
communication  time  is  so  fast,  that  using  additional  processors  immediately  decreases  simulation 
time  (unlike  von  Karman).  At  the  largest  resolution  of  128x128x256  (for  eight  processors)  the 
speed-up  on  Iceberg  is  6,8  compared  to  von  Karman  with  4.7,  and  the  ideal  of  8.0.  As  was 
expected.  Iceberg  outperforms  von  Karman.  It  should  be  mentioned  that  the  increased  speed-up 
factor  for  the  64x64x128  simulation  on  Iceberg  has  nothing  to  do  with  MPI  or  optimization.  This 
has  to  do  with  the  particular  architecture  the  code  is  running  on.  In  this  case,  Iceberg  appears  to 
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handle  the  64x64x128  extremely  well  and  gives  better  than  ideal  speed-up  factors.  With  the 
available  information  on  parallel  performance,  all  of  the  large  simulations  32x32x64  - 
256x256x512  were  performed  with  multiple  processors  at  ARSC. 


6  Reynolds  Stress  Transport  Models 

We  wish  to  demonstrate  that  the  ability  to  model  turbulence  at  all  mesh  scales  (universal 
modeling)  is  not  a  particular  property  of  the  modified  (backscatlering)  k/e  model  presented  in 
section  4.  It  is  a  general  approach  that  can  work  for  many  different  RANS  models.  In  this 
section  we  will  show  the  behavior  of  a  Reynolds  stress  transport  (RST)  model  when  it  is  used  on 
a  fine  mesh  as  an  LES  subgrid  scale  model. 

RST  arc  the  next  level  in  complexity  for  turbulence  modeling  beyond  two-equation  models. 
They  are  particularly  interesting  in  the  context  of  this  work  because  they  can  backscatter  energy 
without  any  explicit  modification.  In  particular,  a  modification  such  as  a  in  the  k/c  model  is  not 
necessary. 

RST  models  contain  more  physics  than  two-equation  models  so  they  can  lead  to  better 
predictions  in  the  coarse  mesh  limit,  and  might  even  lead  to  better  predictions  at  the  LES  mesh 
level.  The  exact  Reynolds  stress  transport  equation  is, 


Dt 


=a+y+  a  -  fs + 

curt  nmt  iTtndrW  m  ntfcted  irwx/fW 


(15) 


This  equation  states  that:  the  material  derivative  of  Rt/  equals  the  rate  of  production  (/J),  plus 
transport  by  diffusion  (D*  +  £>rj) ,  minus  the  rate  of  dissipation  (£v )  t  plus  redistribution  due  to 
turbulent  pressure-strain  interaction  (Lip,  plus  redistribution  due  to  rotation  Qt/.  The  known 
quantities  arc  labeled  'exact'  and  the  unknown  quantities  require  models  to  solve  this  system. 

Major  advantages  with  this  approach  are  that  the  production  term  (usually  a  dominant  term)  is 
now  exact,  along  with  any  rotational  effects.  Because  transport  equations  are  solved  for  the 
evolution  of  Ryj  the  eddy  viscosity  hypothesis  is  not  needed.  Because  an  eddy  viscosity 

assumption  is  not  used,  backscatter  is  possible.  Finally,  since  most  of  the  computational 
expense  for  this  finite  volume  code  is  spent  on  solving  a  Poisson  equation  for  pressure,  additional 
transport  equations  may  improve  the  accuracy  of  the  model  significantly,  with  only  a  minimal 
increase  in  computational  requirements. 
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6.1  Universal  RST  Model 


The  modeled  transport  equation  for  the  Reynolds  stress  tensor  is, 

dH. 


dRm*  3(W  Rmit)  3 

- +  — i - =  - — 

3/  dx.  dx. 


(t/  +  ^r)- 


a*. 


+  Pmn  —  P  Rim  + 


Cp2s(Sm/  R„  +  Sv  RM  +  )  +  Cp2w  ( Wm)RiK  +  IV,  /?„„) 


+CpKW-gmm-R~)  +  Cp2‘Sm,  +  2vT 


1  dk  3fl„ 

v k  J 


(16) 


This  equation  predicts  the  evolution  of  Rit  where  =  —  is  the  dimensionless  stress  tensor. 

k 

The  model  constants  are  given  by,  Cpl=0.9,  Cp2s  =  0.8 ,  Q?2,+  =0.6,  Tlic  value  of 
Cpl*  ^ -0.2F1 ,  where  F  =  determinant  ] .  The  resolved  strain  rate  tensor  is  defined  as 
3nc^  t*ie  resolved  vorticity  tensor  is  given  by  Wfi  -  ,  -uf  i) ,  The 

production  term  is  given  as  Pit  =  -  ( RdUJk  +  R >* uik ) .  The  value  of  P  =  ^Pa ,  and  the  turbulent 

viscosity  is  defined  as  vT  =  — .  Note  that  the  turbulent  viscosity  is  only  used  in  the  diffusion 

terms.  It  is  not  used  to  determine  the  stresses. 


Additionally,  the  transport  equations  for  ut ,  k  and  e  are  again  needed  for  closure  and  are  shown 
below. 


dt 

a*+A< 

dl  d-t 


dx. 
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dx. 
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—  +  —  (£«,)  =  —[( - 
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07) 

dk  ,  - 
—]  +  P-£ 

dx, 

(18) 

|l)+£[Crl?-Cfl£] 

(19) 

Once  again,  overbars  on  ihe  velocity  and  pressure  have  been  dropped  for  convenience.  The 
production  term  (P)  is  defined  above.  The  constants  are  fairly  standard  kl£  constants, 
C,(  =1.55,  <jt  =  1.2,  at  =  1.0,  Cj,  =  0.15.  The  parameter  is  determined  from  the  analysis 

of  Perot  and  de  Bmyn  Kops14. 

6.2  Results 

A  number  of  simulations  were  performed  with  the  RST  model  described  above  using  the  same 
isotropic  decaying  turbulence  (Re-640)  initial  conditions  used  for  the  two-equation  universal  k/£ 
model.  Simulations  were  performed  for  32x32x64  -  128x128x256  mesh  sizes  and  total  kinetic 
energy  predictions  arc  shown  in  Figure  24,  Once  again  the  DNS  data  is  shown  by  the  targe 
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circles.  These  VLES  and  LES  simulations  match  the  DNS  data  well.  Small  improvements 
might  be  possible  by  tuning  some  of  the  RST  model  constants. 


Time(s) 


Figure  24.  RST  model  total  kinetic  energy  predictions  for  (initial  Rc=640)  isotropic  decaying 

turbulence. 


Two  of  the  kinetic  energy  ratios  arc  shown  in  figure  25.  These  two  mesh  resolutions  are  the  most 
difficult  to  capture  since  they  are  where  the  VLES  behavior  transitions  to  URANS.  As  with  the 
previous  k/e  model,  the  RST  model  maintains  a  relatively  constant  kinetic  energy  ratio,  that  tends 
to  drift  downwards  after  long  times.  This  means  that  while  the  different  meshes  give  the  same 
total  behavior  (figure  24),  they  have  different  underlying  solutions.  The  finer  mesh  cases 
automatically  resolve  more  of  the  turbulence  using  first  principals  and  model  less  of  the 
turbulence.  Unlike  the  k/e  version  of  this  approach  (that  required  a  modification  to  enable 
energy  backscatter)  this  RST  model  is  identical  to  its  RANS  counterpart.  It  is  simply  being 
applied  to  LES  meshes,  and  is  shown  to  automatically  perform  as  an  LES  subgrid  scale  model. 
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Figure  25.  RST  model  ratio  of  modeled  kinetic  energy  to  total  kinetic  energy,  Re -640, 


7  Conclusions 

The  primary  goal  of  this  project  was  to  demonstrate  that  RANS  models  can  be  used  to  model 
turbulence  at  any  mesh  resolution.  This  demonstration  has  been  successfully  performed  using 
two  very  different  types  of  RANS  models.  The  key  requirement  of  these  models  is  that  they  must 
be  able  to  backscatter  energy.  So  some  classical  models  which  enforce  a  unidirectional  energy 
transfer  (such  as  simple  eddy  viscosity  models)  cannot  automatically  adapt  to  the  available  mesh 
resolution.  The  effectiveness  of  the  RST  model  suggests  that  many  algebraic  Reynolds  stress 
models  and  nonlinear  k/e  models  (that  theoretically  have  the  ability  to  backscatter)  probably 
would  perform  like  RST  model  and  not  require  any  backscatter  modification. 

We  have  demonstrated  that  this  modeling  approach  automatically  computes  as  much  of  the 
turbulence  as  the  mesh  allows,  and  models  the  rest.  Different  mesh  resolutions  produce  the  same 
total  behavior,  but  produce  very  different  underlying  solutions.  The  finer  the  mesh,  the  more 
three-dimensional  and  unsteady  the  solution  stays.  Despite  the  fact  that  the  equations  are 
identical,  the  nonlinearity  of  the  equation  system  means  that  solutions  are  sensitive  to  the  mesh 
resolution.  The  method  is  robust.  Starting  with  very  different  initial  conditions,  results  in 
essentially  the  same  solution  behavior  after  long  times.  The  length  of  time  to  adjust  from 
arbitrary  initial  conditions  appears  to  be  proportional  to  the  modeled  turbulence  eddy  turn  over 
time. 
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It  was  also  demonstrated  that  this  approach  reproduces  the  classical  LES  length  scale  assumption 
(eddy  length  proportional  to  the  mesh  size)  in  the  limit  that  that  assumption  is  valid  (in  the  inertial 
range).  Tire  advantage  of  the  demonstrated  approach  is  that  it  is  valid  much  more  broadly  -  both 
in  the  RANS  and  DNS  limits,  and  in  complex  flows  where  an  inertial  range  may  well  not  exist. 

The  practical  utility  of  a  turbulence  modeling  approach  that  can  be  applied  to  any  mesh  cannot  be 
overstated.  This  model  is  not  dictated  by  the  proximity  of  walls  (like  DES).  It  is  not  controlled 
by  the  user  (like  PANS),  since  it  is  rarely  known  to  the  user  beforehand  what  kinetic  energy  ratio 
would  be  correct  for  a  particular  mesh.  It  does  not  blend  an  LES  and  RANS  model  together  in 
an  ad  hoc  way,  like  the  other  hybrid  models.  This  approach  is,  in  many  ways,  much  simpler. 
We  have  demonstrated  that  appropriately  chosen  RANS  models  can  function  well  at  any  mesh 
resolution.  These  models  lead  to  a  universal  modeling  approach  that  can  be  applied  at  any  mesh 
resolution.  The  user  specifies  what  they  can  afford,  and  the  model  provides  the  best  solution  it 
can  with  what  the  user  can  afford. 

This  approach  has  the  potential  to  make  LES  far  more  accessible.  Existing  commercial  and 
industrial  CFD  codes  with  RANS  models  can  very  easily  do  LES  now.  A  new  code  and  model  is 
not  necessary.  New  expertise  is  not  necessary. 

Finally,  this  project  has  broken  the  artificial  wall  that  had  been  built  between  the  LES  and  RANS 
modeling  approaches.  Despite  the  differences  in  terminology,  and  the  insistence  of  various 
model  developers  to  the  contrary,  this  work  unambiguously  shows  there  is  no  fundamental 
distinction  between  the  RANS  and  LES  modeling  approaches. 
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9  Appendices 

A.  Derivation  of  the  Evolution  Equations 

The  incompressible  Navier-Stokes  equations  are: 

"*.*=0  (7) 

u»+(u,uk).t  =-P.(  +  £V*  (W 

Following  the  ideas  of  Gennano,  the  moment  of  (8)  is  taken  with  u} }  another  moment  of  this 
same  equation  with  indices  interchanged  yields  (9)  and  (10). 

V'„  +  uj (M.u*  h  ~  ~P.'Ui  +  (9) 

»,«iJ  +  «l («/«J  ).t  =  +  WfO'yi.*  (  1  0) 

Adding  together  (9)  and  (10)  and  using  a  little  calculus,  it  can  be  shown  that  the  resulting 
equation  can  be  written  as: 

(«,«, ),,  +  ("."AX*  =-\_Pu,5,k  +  Pu At  ~v(uiuj) +2/«®- 2^**0.*  (H) 


if  a  Newtonian  fluid  ov  =  vu,  l  is  assumed.  Recall  that  the  turbulent  stress  tensor  was  previously 
defined  as  Ri;  =  utut  -w,»y  ,  the  double  bracket  was  <  anbj  >  =  apj  -aib/ ,  and  the  turbulent 
transport  term  was  defined  by  Tlji&uiuJut-ulRjt—UjRjt—ukRg—uiUjUk  .  Simply  averaging 

equations  (7),  (8),  and  (1 1)  and  substituting  for  the  stress  tensor,  double  bracket,  and  turbulent 
transport  will  recover  the  equations  below: 

iu.t  =  0  (12) 


uu  +  (uiui ) ,  =  -p,  +  vu,j.,  -  R,j  j  ( 1 3) 

Pi), I  Pi/.t  "  vR'jH  —  (Rjtuik  +  R:iu)  k ) 

-  <  T„k  >.*  “(<  P,P*j  >  +  <P.,>li<  >)~2v<  uit ,hm  > 

Hence  it  is  shown  that  the  averaging  invariance  procedure  of  Gcrmano  docs  indeed  provide 
evolution  equations  that  are  exactly  the  same  as  the  Reynolds  stress  transport  (RST)  equations. 


B,  CPU  Neighbors 

Subroutine  CPOneighf) 

Implicit  None 

Integer  ;;  iz,iy,ixf  tmp 
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I l location  in  CPU  grid 
iz  =r  my  id/  (NXP*NYF)  +  1 
tmp  -  myid- ( iz- 1 ) *NXP*NYP 
iy  =  tmp/NXP  +  1 
ix  =  tmp  -(iy-l)*NXP  +  1 


! Compute  neighboring  CPU  locations  below 

CPUinfo ( 1 )  =  myid+1  Jx-dir  right 

if  (  ix  ==  NXP)  CPU  inf o(l)  =  CPUinfoU)  -  NXP 


CPU  info (2)  =  myid-i  lx- dir  left 

if  (  ix  ==  1  }  CPUinf o (2)  »  CPU_info(2)  +  NXP 


CPU  inf o { 3 )  *  myid+NXP  [y-dir  top 

if  (  iy  NYP}  CPU  infoO)  «  CPU_info{3)  -  NXP*NYP 

CPU_inf o ( 4 )  s  my id -NXP  !y-dir  bottom 

if  (  iy  »  1  )  CPU  info ( 4 )  =  CPU_info{4)  +  NXP*NYP 

CPU_info(S)  =  myid+NXP*NYF  !z-dir  back 

if  (  iz  NZP)  CPU_info{5)  -  CPU_info(5)  -  NXP*NYP*NZP 


CFU_info(6)  =  myid-NXF*NYP  Iz-dir  front 

if  (  iz  ==  1  )  CPU_info(6)  -  CPU_info(6)  +  NXP*NYP*NZP 


End  Subroutine  CPU  neigh 

Integer  'myid1,  is  a  variable  MPI  assigns  to  each  processor  in  use.  If  eight  processors  are  used  in 
parallel  'myid'  would  range  from  0  -  7,  The  variables  NXP,  NYP,  and  NZP  correspond  to  the 
number  of  processors  in  each  direction  (x,y,z),  The  calculated  integer,  CPUinfo  1,  is  the  CPU 
located  to  the  right  of  the  processor  in  question.  CPUinfo2  computes  the  CPU  to  the  left  of  the 
processor  in  question,  CPUinfo3  and  CPUinfo4  correspond  to  the  top  and  bottom  CPUs 
respectively.  Lastly,  CPUinfoS  and  CPUinfob  are  the  back  and  front  CPUs  located  relative  to  the 
CPU  in  question.  Further  clarification  by  (top,  bottom,  etc,)  was  dearly  shown  in  Figure  16. 


C.  Periodic  Boundary  Condition  Data  Transfer 

!  Nonblocking  MPI  send  x 
Subroutine  SendBCX { NX , NY ( NZ  f var ) 

Implicit  None 
Integer  ::  NX,NY,NZ 

Real,  Intent (IN)  ::  var { 0 : NX  + 1 , 0 : NY +1,0: NZ  + 1 ) 
Integer  ::  num,  tagl,tag2,  cpul,cpu2 
Real  ::  tmpl , tmp2 

num  =  ( NY  +  2 ) * (NZ  +  2) 
cpul  ^  CPU  inf o(l) 
cpu2  =  CPU_info(2) 
tagl  =  tag+1 
tag2  a  tag+2 

’Sending  right  y-z  plane 

sbuffx^r (O-NY+1, O-NZ+1)  =  var (NX, 0 : NY+1 , 0 ;NZ+1) 
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call  mpi_isend  ( sbuf f x_r , num, datasize , cpul ,  tagl , MPICOMMWORLD, 
handle_sx (1 ) , ierr} 

l Sending  left  y-z  plane 

sbuf f X_l (0 :NY+1, 0 : NZ4l)  ^  var ( 1 , 0 ; NY+1 , 0 :NZ  +  1 ) 

cal 1  mpiisend  (sbuffxj , num, datasize , cpu2 , tag2, MPI_COMM_WORLD, 
handlesx ( 2 } , ierr) 

! Receiving  right  y-z  plane 

call  mpi_irecv  (rbuffx^r, num, datasize, cpul , tag2 , HP  I  COMM  WORLD, 
handle_rx ( 1 ) , ierr) 

[Receiving  left  y-z  plane 

call  mpiirecv  ( rbuffxl , num, datasize ( cpu2 , tagl , MPICOMMWORLD, 
handle_rx { 2 ) , ierr) 

End  Subroutine  SendBCX 

****************************** ***+***-*************+******************** 

!  Nonblocking  MPI  recieve  -  x 

Subrout ine  RecvBCX (NX , NY , NZ, var) 

Implicit  None 

Integer  ::  NX,NY,N2  ,  i 

Real  ::  var ( 0 : NX+ 1 , 0 : NY+ 1 , 0 : NZ+ 1 } 


i  wait  for  recv  and  send  to  complete 

call  mpi_wai tall (2 , handle_sx, statuss , ierr)  (wait  for  sends 
call  mpiwaltall (2 , handle  rx , status_r, ierr)  Iwait  for  recvs 

I  from  the  left  (high  ft) 

(move  data  values  from  buffers  into  ghost  cells 
var {  0 ( 0 : NY  +  1 , 0 : NZ  +  1 )  =  rbuffx_l (0 : NY+1 , 0 : NZ  +  1) 

(from  the  right  CPU  (low  ft) 

(move  data  values  from  buffers  into  ghost  cells 
var (NX+1, 0:NY+ 1,0: NZ+1)  -  rbuf f X_r { 0 : NY+1 , 0iN2+l) 

End  Subroutine  Recv  BCX 


D,  Typical  MPI  functions  Used 

Great  detail  explaining  MPI  function  calls  can  be  found  at:  hltp;//www-umx jncs.anl.gov/ 
Below  are  a  selection  of  the  main  function  calls  used  in  this  project. 


****-******#*********#***************+***********************+********** 

MPI__Isend(  void  *bu£,  int  count,  MPIDatatype  datatype,  int  dest,  int  tag, 
MPI_Comtn  comm,  MPI_Request  *  request  ) 

MPIIrecvt  void  *buf,  int  count,  MPIDatatype  datatype,  int  source, 
int  tag,  MPTComm  comm,  MPI_Request  * request  ) 


Input  Parameters: 

buf ;  initial  address  of  send/recieve  buffer  (choice). 
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count : 
datatype : 
source : 
dest : 
tag: 
comm : 


number  of  elements  in  send/receive  buffer  [integer) . 

datatype  of  each  send/receive  buffer  element  (handle) . 

rank  of  source  (integer), 

rank  of  destination  (integer). 

message  tag  (integer). 

communicator  (handle) . 


Output  Parameter 

request:  communication  request  (handle) 

All  HPI  routines  in  Fortran  have  an  additional  argument  ierr  at  the  end  of 
the  argument  list, 

*****  *************  ******  *  **************************  ******************** 

MPISend {  void  *bu£,  int  count,  MPIDatatype  datatype,  int  dest, 
int  tag,  MPI_Comm  comm  ) 

MPIRecvf  void  *buf,  int  count,  MPI_Datatype  datatype,  int  source, 
int  tag,  HPI_Comm  comm,  MPI_Status  ^status  } 


Input  Parameters: 

but:  initial  address  of  send/receive  buffer  (choice) 

count:  number  of  elements  in  send  buffer  (nonnegative  integer) 

datatype:  datatype  of  each  send/receive  buffer  element  (handle) 

dest:  rank  of  destination  (integer) 

tag:  message  tag  (integer) 

comm:  communicator  [handle) 


Output  Parameters : 

but :  initial  address  of  receive  buffer  (choice) 

status:  status  object  (Status) 

*********************************************************************** 

MPIWaitall ( 

int  count, 

MPIRequest  a r ray  of  requests  □  , 

MPlStatus  arrayof  statuses []  ) 

Input  Parameters: 

count:  lists  length  (integer). 

array_of_requests :  array  of  requests  (array  of  handles) 

Output  Parameter:  array_of_statuses 

array  of  status  objects  (array  of  Status) .  Hay  be  HPI _STATUSES_I CHORE 
***************************** **+*****+********+***********+*******+**** 

MPI_Al 1 reduce  f  void  *sendbuf,  void  *recvbuf,  int  count, 

MPIDatatype  datatype,  MPI  Op  op,  MPI_Comm  comm  ) 


I  nput  Pa  r atne  t  e  r  s  : 


sendbuf : 
count : 
datatype : 

op: 
comm ; 


starting  address  of  send  buffer  (choice) 
number  of  elements  in  send  buffer  (integer) 
data  type  of  elements  of  send  buffer  (handle) 
operation  (handle) 
communicator  (handle) 


Output  Parameter: 
recvbuf : 

starting  address  of  receive  buffer  (choice) 
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E*  Global  Values 


Real  Function  pSUM(var) 

Implicit  None 

Real  : :  num,  var ( ; , z ,  : ) 

Real  : :  data_G 

num  =  SUM (var) 

if  (numprocs  >  1)  then 

Call  MP I _A1 1 reduce (num, data_G( 1 , datasi ze( MPI_SUM, MPICOMMWQRLD, 
ierr)  ! MPIREAL 
pSUM  =  dataG 
else 

pSUM  -  num 
end  if 

End  Function  pSUM 


*****#*4**********4*****4*****  ********4******  ************** +  *  +  #+  +  *#•*** 

Real  Function  pMAXVAL. { var ) 

Implicit  None 

Real  i z  num,  var ( i ,  r , ; ) 

Real  : :  data_G 

num  =  MAXVAL (var) 

if  (numprocs  >  1)  then 

Call  MPI  Al lreduce {num, data  G f I , datasi ze , MPIMAX , MPI_COMM_WQRLD , 

ierr)  IMPI_REAL 
pMAXVAL  =  dataG 
else 

pMAXVAL  =  num 
end  i  f 

End  Function  pMAXVAL 


**************************************************************  ********* 

Real  Function  pMINVAL { var) 

Implicit  None 
Real  i:  num,  var { : , i , : 1 
Real  : :  dataG 

num  =  MINVAL { var) 

if  (numprocs  >  1)  then 

Call  M  PI  _A 1 1 reduce (num, data  G, 1 , datasi ze, MPI  MIN , MPICOMMWQRLD, 

ierr)  IMPIREAL 
pMINVAL  =  data_G 
else 

pMINVAL  =  num 
end  if 

End  Function  pMINVAL 
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F,  Data  Output 


Subroutine  TKEOUT (dir) 

Implicit  None 

Real  ::  U J ( 0 : DNX  + 1,0: DNY  +  1,0: DNZ  + 1 } 

Real , dimension ( 0 : NX+1 , 0 : NY+1 , 0 :NZ+1 }  ::  rbuff 

Real  intsend { 1 : 3 ) , intrecv ( 1 : 3 } 

Integer  : :  sendcount , recvcount , root , tag, tag2 , ierr, ss (MPI  STATUSSIZE) 

i (NXP,  NYP,  NZP,  il,  jl  ,kl)  are  computed  exactly  as  in  input 
root  =  0  ! assign  root  processor  to  be  myid  -  0. 
recvcount  =  ( (NX+2) * (NY+2) * (NZ+2) )  (size  of  data  to  be  sent 
sendcount  =  recvcount  l size  of  data  to  be  received* 

If  (myid  «  0)  Then 

! Store  root  processor  data  section  into  larger  array  (uJ) . 

uJ(  NX* (il-1)  :  (il*NX) +1  ,  NY* { j 1- 1 J :  ( j 1*NY>  +  1 ,  NZ* {kl -1)  :  (kl*NZ) +1  )  » 

tke (0 : NX  +  1 ,  0 : NY  + 1 ,  0:NZ+1) 

l After  storing  root  information,  receive  the  values  (myid,  ilf  jl,  kl) 
Ifrom  all  other  processors 

do  i  ^  1,  numprocs-1  ! loop  over  all  processors  (exept  root) 

INow  receive  data  from  processor  "i" 

Call  MPIRECV  ( rbuf  f  ,  recvcount ,  datasize,  i , tag,  MPI_CQMM_WORLD,  ss,  ierr ) 

INow  receive  (i!,jk,kl)  from  the  processor  that  just  sent  data 
Call  MPI  RECV ( intrecv, 3 , datasi ze , i , tag2 , MPICOMM_ WORLD, ss , ierr) 

I  with  myid  known,  along  with  (il,jl,kl)  reconstruct  the  large  array  (uJ) 
!from  the  receive  buffer* 

uJ(  NX* (il-1) : (il*NX}+l  ,  NY* { jl-1) : [ jl*NY) +1,  NZ* {kl - 1 ) : ( kl *NZ ) + 1  ) 

=  rbuf f (0:NX+1,  0 : NY  +  1 ,  0:NZ+1) 

end  do  ! stops  loop  on  root 

! ! I ! !  now  write  out  total  domain  to  binary  file, 
open  (unit  =  500 , file  =  trim(strl)  , form^  r hi nary ' ) 
write (500)  ( ( (uJ(i, j ,k) ,i=l,DNX) ,  j=l , DNY) , k=l,DNZ) 

close (unit =500 ) 

Else 

Ilf  processor  is  not  root,  then  send  your  tke  data  to  root. 

Call  MPI^SEND (tke, sendcount, datasi ze, root , tag, MPI_COMM_WORLD , ierr) 

Istore  il,jl,kl  in  an  array  called  tintsend) 
intsend(l)  *  real(il) 
intsend(2)  =  real(jl) 
intsend(3)  =  real(kl) 

INow  send  values  of  (il,jl,kl)  to  root. 

Call  M  P I _SEND ( i nt send , 3,datasize, root , tag2 , MPI_CQMM_WORLD, ierr) 

End  If 
Return 

End  Subroutine  TKE  OUT 
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